Let’s simulate a random sample from a distribution with density \(f(x) = \frac{3}{32}x^5 \ \ \ 0<x<2\). We find the distribution to be \(F_X(x) = \frac{x^6}{64}\) and its inverse \(x= F^{-1}_X(u) = 2u^{\frac{1}{6}}\) where \(U \sim \text{Uniform}[0,1]\) by the probability integral transform. We can now sample values of x:
n <- 1000
u <- runif(n)
x <- 2*u^(1/6)
hist(x, probability = TRUE, main = expression(f(x)==textstyle(frac(3,32))*x^5))
y <- seq(0, 2, .01)
lines(y, 3/32 * y^5)
How did we get this uniform random variable? There is a method to the madness.
Let’s take a look at our CDF:
Geometrically, we have a smooth mapping that is contracting the the interval \([0,2]\) to the interval \([0,1]\). However, this is not being done evenly. Let’s bin the x-axis into intervals of width \(.25\) and compute the direct image of each bin.
bin | image | img_length |
---|---|---|
[0 , 0.25] | [0 , 0] | 0.000 |
[0.25 , 0.5] | [0 , 0] | 0.000 |
[0.5 , 0.75] | [0 , 0.003] | 0.003 |
[0.75 , 1] | [0.003 , 0.016] | 0.013 |
[1 , 1.25] | [0.016 , 0.06] | 0.044 |
[1.25 , 1.5] | [0.06 , 0.178] | 0.118 |
[1.5 , 1.75] | [0.178 , 0.449] | 0.271 |
[1.75 , 2] | [0.449 , 1] | 0.551 |
We partition the domain into bins of equal length, which under the image of the CDF, paritions the interval \([0,1]\) into bins of very unequal length. If we then sample uniformly from the interval \([0,1]\) and return each point to its original bin (i.e take the inverse), we will end up with our original distribution
The image length is just the proportion of points sampled uniformly from [0,1] whose inverse falls into the specified bin.
Suppose we are told that a coin is flipped 50 times, 10 of which land heads. We want to simulate a sample of possibles values for the bias of the coin. We can model our belief about the bias as a beta distribution with parameters a = 10
and b = 40
with density \[f(x) = \frac{49!}{9!39!}x^9(1-x)^{39}\]
We will use the uniform distribution as our g(x)
since it has the same domain and is easy to sample from. Now we solve for the threshold parameter c
which satisfies: \[ \forall x \in [0,1]: \hspace{2mm} \frac{f(x)}{g(x)} = \frac{49!}{9!39!}x^9(1-x)^{39} < c\]. To solve for c
, our graph of the density reveals we should choose the mode of the distribution for our choice of x
. which has a convenient formula: \[mode = \frac{(a-1)}{(a+b-2)}\]
The mode of the beta distribution is 0.1875 and we calculate that c = 7.159. It follows that a random x
from g(x)
is accepted if \[\frac{f(x)}{cg(x)} = \frac{\frac{49!}{9!39!}x^9(1-x)^{39}}{7.159} > u\]
for some random uniform u
.
n <- 1000 #target sample size
k <- 0 #counter for accepted
j <- 0 #iterations
y <- numeric(n)
acceptance_ratio <- function(x){beta_density(x)/my_c}
while(k < n){
acceptance_threshold <- runif(1)
j <- j + 1
x <- runif(1) #random variate from g
if(acceptance_ratio(x) > acceptance_threshold){
#accept x
k <- k + 1
y[k] <- x
}
}
In this simulation, j = 7177 iterations were required to generate a sample of n = 1000 variates, compared to the expected \(cn =\) 7159 iterations. Comparing the empirical and theoretical deciles confirms that our sample fits the beta distribution.
10% | 20% | 30% | 40% | 50% | 60% | 70% | 80% | 90% | |
---|---|---|---|---|---|---|---|---|---|
Qsim | 0.1315288 | 0.1525328 | 0.1681874 | 0.1816580 | 0.1967094 | 0.2097532 | 0.2256079 | 0.2427806 | 0.2769144 |
Q | 0.1308707 | 0.1515751 | 0.1675891 | 0.1819719 | 0.1959798 | 0.2105138 | 0.2266215 | 0.2461659 | 0.2744112 |
Let \(X_1 \dots X_n\) be an i.i.d list of random variables such that \(X_j \sim X\). The distribution of their sum \(S = X_1 + \dots + X_n\) is called the n-fold convolution of \(X\) and has distribution \(F^{*(n)}_X\). For exapmle, the chi-squared distribution with degree of freedom \(\nu\) is the \(\nu\)-fold convolution of squared standard normals. Let’s use this fact to simulate a random sample of size \(n\) from a \(\chi^2_{\nu}\) distribution.
library(tidyverse)
n <- 1000
nu <- 6
X <- matrix(rnorm(n*nu), n, nu)^2
y <- rowSums(X)
comp <- data.frame(
theoretical = c(nu, 2*nu),
empirical = c(mean(y), mean(y^2) - (mean(y))^2),
row.names = c("mean", "variance")
)
knitr::kable(comp)
theoretical | empirical | |
---|---|---|
mean | 6 | 6.25575 |
variance | 12 | 12.54616 |
We have accurately generated a chi-squared random sample.
A discrete mixture of random variables has a distribution \[F_X = \sum_{j=1}^n\theta_j F_{X_j}\] where our weights \(\theta_j\) sum to 1 and \(X_1 \dots X_n\) is any sequence of random variables.
Here I will investigate the behavior of a specific mixture of gamma random variables given by \[X_j \sim \text{Gamma}(r,\ \lambda_j = \frac{1}{j})\] \[\theta_j = \frac{2j}{n(n+1)}\] \[1\leq j \leq n\]. Our mixed random variable is \(Y_n\) with distribution \[F_{Y_n} = \sum_{j=1}^n\theta_j F_{X_j}\]
I will simulate from mixtures of different sizes. Notice that in our mixture, the contribution of variables appearing later in the sequence is much higher than that of the previous ones. I hyopthesize that as \(n\) increases, \(F_{Y_n} \rightarrow F_{X_n}\).
library(tidyverse)
N <- 1000 # sample size
n <- 50
r <- 3
gamma_mixture <- function(n = 50, r = 3, N = 1000){
# n is the number of variables we are mixing
# N is the sample size we a generating
index <- 1:n
weights <- index/((n*(n+1))/2)
index_sample <- sample(index, size = N, replace = TRUE, prob = weights)
lambda <- 1/index_sample
mixture_sample <- rgamma(N, shape = r, rate = lambda)
mixture_sample
}
sizes <- c(50, 200, 1000)
sizes %>%
map(function(x) gamma_mixture(n = x)) ->
mixture_list
sizes %>%
map(function(x) rgamma(N, shape = r, rate = 1/x)) ->
gamma_list
df <- tibble(
data = c(mixture_list, gamma_list),
n_size = rep(sizes, times = 2),
distribution = factor(rep(c("mixture", "gamma"), each = 3), levels = c("mixture", "gamma"))
)
library(magrittr)
library(scales)
df %<>% unnest()
ggplot(data = df, aes(x = data)) +
geom_density(position = "stack", aes(color = distribution)) +
scale_x_continuous(breaks = scales::pretty_breaks(n = 3)) +
facet_grid(. ~n_size, scales = "free", labeller = label_bquote(cols = n == .(n_size)))
The graphs supports our hypothesis that the similarity between the distributions of \(Y_n\) and \(X_n\) gets stronger as \(n\) increases. You can see the sharp peak of the mixture distribution rapidly flattens out to match the gamma distribution. Note the variance of a gamma random variable \(X_n\) with rate parameter \(\frac{1}{n}\) and shape parameter \(r\) is \[ Var[X_n] = n^2r\]
so the variance increases exponentially with our choice of \(n\), flattening out the graph. In contrast, our gamma mixture takes contributions from all \(X_{1\leq j\leq n}\) so it will have less variance and thus a sharper peak.
See:
Rizzo: Statistical Computing With R (2007) Chapter 3: Methods of Generating Random Variables
We can use random number generation to estimate the integral of a function that would be difficult to solve analytically. In this case the quantity we want to estimate is really the value of an integral \(\theta = \int_{\chi}g(x)dx\). The connection between generic integrals and random variables is given by the definition of expectation of a transformed random variable. Recall that if \(X\) is a random variable with density \(f(x)\), then the mathematical expectation of the transformed random variable \(g(X)\) is \[ \mathbb{E}_f[g(X)] = \int_{\chi}g(x)f(x)dx\] where \(\chi\) is the support of the density \(f\). A simple case occurs when integrating a function \(g(x)\) over \([0,1]\) because we can choose \(X \sim \text{Uniform}[0,1]\). \[\theta = E[g(X)] = \int_0^1g(x)(1)dx = \int_0^1g(x)dx\] We can then generate a random sample of uniforms \(X_1 \dots X_m\) to get the monte-carlo estimate of \(\theta\) using the sample mean: \[\widehat{\theta}_{MC} = \widehat{\mathbb{E}_f[g(X)]} = \frac{1}{m}\sum_{i=1}^mg(X_i)\] Since \(X_1 \dots X_n\) is an i.id sample, by the law of large numbers, the sample mean converges to the true expected value. \[P\bigg(\lim_{m\rightarrow \infty}\left| \frac{1}{m}\sum_{i=1}^mg(X_i) - E[g(X)]\right|<\epsilon \bigg) = 1\] So we can expect that our estimate will approximate \(\theta\) given a large enough sample. Let’s use this to estimate \[\theta = \int_0^1 e^{-x}dx\] and compare the estimate with the exact value.
m <- 10000
x <- runif(m)
theta.hat <- round(mean(exp(-x)),6)
soln <- round(1-exp(-1),6)
st_err <- format((1/m)*(sum(exp(-x)-theta.hat)^2)^(1/2), digits = 3)
The estimate is \(\widehat{\theta}=\) 0.631785 and \(\theta = 1 -e^{-1}\) = 0.632121.
We can quantify the accuracy of our estimate. Note that in statistical parlance \(\theta\) is a parameter we want to estimate, and our estimator is a sample statistic \[\widehat{\theta}_m = T(g(X_1), \dots g(X_m))\] The population distribution of our sample statistic is called the sampling distribution. We want to find the standard deviaton of the sampling distribution called the standard error, denoted \(SE(\widehat{\theta}_m)\). Since our estimator is a sample mean from the distribution of \(g(X)\), we can estimate the standard error with \[SE(\widehat{\theta}_m) = \frac{\widehat{\sigma}}{\sqrt{m}} = \frac{1}{m}\left\{\sum_{i=1}^m|g(x_i)-\overline{g(x)}]^2\right\}^{\frac{1}{2}}\]
Using this formula we calculate \(SE(\widehat{\theta}_{10^5})\) = 4.61e-07 for the previous example.
We need not limit ourselves to uniform random variables for our monte-carlo estimator. Suppose we want to estimate the quantity \(\theta = \int_{\chi}g(x)dx\). We can look for a random variable \(X\) with pdf \(f\) that is supported on \(\chi\) and easy to sample from. Then we define the transformation \(h(X) = \frac{g(X)}{f(X)}\). Taking the expectation gives us\[ \mathbb{E}_f[h(X)] = \int_{\chi}\frac{g(x)f(x)}{f(x)}dx = \int_{\chi}g(x)dx\] We can get a monte-carlo estimate of \(\theta\) using a sample of random variables \(X_1, \dots, X_m\) generated from \(f\), tranforming them by \(h\), and tking the mean: \[\widehat{\theta}_{MC} = \widehat{\mathbb{E}_f[h(X)]} = \frac{1}{m}\sum_{i=1}^m\frac{g(X_i)}{f(X_i)}\] Note that the example in (2.1) is special case where \(f(X_i) = 1\) because we are using uniform random variables. In this example we will consider a function defined on an infinite domain, so we cannot choose any uniform pdf for our \(f\). Consider \[\int_0^{\infty}\sin(x^2)e^{-x^2}dx\] If we choose \(X \sim \exp(\lambda = 1)\) then \(f(x) = e^{-x}\) on \(x\geq 0\) and \[h(x) = \frac{\sin(x^2)e^{-x^2}}{e^{-x}} = \sin(x^2)e^{-x^2 + x}\]
m <- 10^5
sample <- rexp(m, rate = 1)
h <- function(x){
sin(x^2)*exp(-x^2 + x)
}
estimate <- sum(h(sample))/m
Our Monte Carlo estimate \(\widehat{\theta}_{MC}\) = 0.2853661 approximates the true value determined by Mathematica:
\[\pmb{\text{}}\\ \pmb{\text{Integrate}\left[\text{Sin}[x{}^{\wedge}2]*e^{-x^2}, \{x, 0 , \text{Infinity}\}\right]} = \frac{\sqrt{\pi } \ \text{Sin}\left[\frac{\pi }{8}\right]}{2\ 2^{1/4}} = 0.285185\]
See:
Chihara, Hesterberg: Mathematical Statistics with Resampling and R (2012) Chapter 11.6: Monte Carlo Integration
Given a density \(g\) that is strictly positive on the supported domain of \(h\times f\) , we can write \[\mathbb{E}_f[h(X)] = \int_{\chi}h(x)\frac{f(x)}{g(x)}g(x)dx = \mathbb{E}_g\left[\frac{h(X)f(X)}{g(X)}\right]\]
Instead of sampling from a pdf \(f\) we will can sample from a pdf \(g\). This importance sampling fundamental identity justifies the use of the estimator \[\widehat{\mathbb{E}_f[h(X)]} = \frac{1}{m}\sum_{j=1}^m\frac{h(X_j)f(X_j)}{g(X_j)}\] based on a sample generated from \(g\).
Rick and Morty have just returned from an off-world adventure on alien exoplanet DW7449 where they were retreiving slime samples from Type III Lazy Slug, a possible candidate for a new flavor of Sechuan Sticky Sauce. Two important flavor variables they need to measure are the slime’s schleeb and its glubis which are known to be normally distributed. While taking measurements in Rick’s workshop, Morty slipped on a slime patch crashing into the samples, corrupting the data before they could finish recording the glubis. Fortunately, Rick had already predicted that a typical Morty-related disaster would occur, and had preprogrammed an EM algorithm to rescue the lost data. Let’s investigate what he did.
library(ggthemes)
library(MASS)
set.seed(3)
n <- 100 #sample size
d <- 2 #dimension
mu <- c(1,14) #vector of means
SIGMA <- matrix(c(12,3,3,6), ncol = 2, byrow = TRUE)
slime_data <- data.frame(mvrnorm(n, mu, Sigma = SIGMA))
names(slime_data) = c("schleeb", "glubis")
lost_data <- sample(1:n, size = n/2)
slime_data[lost_data, 2] <- 0 #Set the lost data values to 0
slime_plot <- ggplot(slime_data, aes(schleeb, glubis)) +
geom_point(color = "yellow") +
theme_solarized(light = FALSE) +
theme(text = element_text(size=20, color = "green")) +
stat_smooth(method = "lm", color = "red", se = FALSE) +
coord_cartesian(ylim = c(-1,20))
This first step in the EM algorithm is create an artificially complete data set by arbitrarily setting the missing glubis values to 0. We call this dataset \(slime^{(0)}\) it is plotted below. We don’t expect the regression line to be accurate.
slime_plot
We know the slime data comes from a bivariate normal distribution with parameters \(\theta = (\mu_1, \mu_2, \sigma_1, \sigma_2, \rho)\). To get maximum likelihod estimates for \(\mu_2, \sigma_2\), and \(\rho\) we need to properly fill in the missing data for the glubis.
To get one interation of the EM algorithm we can use maximum likelihood estimators to produce an estimate for \(\widehat{\theta}^{(0)} = (\mu_1^{(0)}, \mu_2^{(0)}, \sigma_1^{(0)}, \sigma_2^{(0)}, \rho^{(0)})\) This is the M(“Maximization Step”)
#construct a function that computes MLE's
theta_hat_estimator <- function(schleeb = slime_data$schleeb, glubis = slime_data$glubis){
mu_1 <- mean(schleeb)
mu_2 <- mean(glubis)
sigma_1 <- sd(schleeb)
sigma_2 <- sd(glubis)
rho <- cor(schleeb,glubis)
return(c(mu_1, mu_2, sigma_1, sigma_2, rho))
}
We can now impute the missing values using the formula for conditional expectation of a bivariate normal variable where \(\theta = \widehat{\theta}^{(0)}\). This is the E“Expectation” step : \[\mathbb{E}[x_{2i}|\widehat{\theta}^{(0)}] = \mu_2^{(0)} + \rho^{(0)}\frac{\sigma_2^{(0)}}{\sigma_1^{(0)}}(x_{1i} - \mu_1^{(0)}) \]
#construct a function to update the slime data from slime_j -> slime_j+1
impute_glubis <- function(x_1, mu_1, mu_2, sigma_1, sigma_2, rho){
mu_2 + rho*(sigma_2/sigma_1) * (x_1 - mu_1)
}
The E and M steps are repeated giving a new dataset \(slime^{(j)}\) at the \(j^{th}\) stage and updating the parameter estimate \(\widehat{\theta}^{(j)}\) until reaching some convergence threshold \(||\widehat{\theta}^{(j+1)} - \widehat{\theta}^{(j)}|| < \epsilon\).
#Run the EM algorithm
epsilon <- .000005 # convergence threshold
theta_hat_0 <- rep(0,5) #initialize temp variable
theta_hat_1 <- theta_hat_estimator(slime_data$schleeb, slime_data$glubis)
var_path <- c(0,theta_hat_1)
step = 1
condition <- function(){
#convergence condition
sqrt(sum((theta_hat_1 - theta_hat_0)^2)) > epsilon
}
while(condition()) {
#EM algorithm
slime_data[lost_data, 2] <- impute_glubis(slime_data[lost_data, 1], theta_hat_1[1], theta_hat_1[2], theta_hat_1[3], theta_hat_1[4], theta_hat_1[5])
theta_hat_0 = theta_hat_1
theta_hat_1 = theta_hat_estimator(slime_data$schleeb, slime_data$glubis)
var_path = rbind(var_path, c(step, theta_hat_1))
step = step + 1
}
knitr::kable(var_path, booktabs = F, align = c("c"),
col.names = c("step", "$\\mu_1$", "$\\mu_2$", "$\\sigma_1$", "$\\sigma_2$", "$\\rho$")
)
step | \(\mu_1\) | \(\mu_2\) | \(\sigma_1\) | \(\sigma_2\) | \(\rho\) | |
---|---|---|---|---|---|---|
var_path | 0 | 0.9787029 | 6.987215 | 3.081482 | 7.259808 | 0.1191515 |
1 | 0.9787029 | 10.440005 | 3.081482 | 4.038984 | 0.2198468 | |
2 | 0.9787029 | 12.165318 | 3.081482 | 2.649032 | 0.2766403 | |
3 | 0.9787029 | 13.035294 | 3.081482 | 2.121314 | 0.2766010 | |
4 | 0.9787029 | 13.477175 | 3.081482 | 1.944324 | 0.2495371 | |
5 | 0.9787029 | 13.702908 | 3.081482 | 1.887004 | 0.2238040 | |
6 | 0.9787029 | 13.818741 | 3.081482 | 1.867507 | 0.2065273 | |
7 | 0.9787029 | 13.878386 | 3.081482 | 1.860207 | 0.1962670 | |
8 | 0.9787029 | 13.909180 | 3.081482 | 1.857164 | 0.1904945 | |
9 | 0.9787029 | 13.925111 | 3.081482 | 1.855776 | 0.1873364 | |
10 | 0.9787029 | 13.933366 | 3.081482 | 1.855101 | 0.1856364 | |
11 | 0.9787029 | 13.937648 | 3.081482 | 1.854762 | 0.1847306 | |
12 | 0.9787029 | 13.939871 | 3.081482 | 1.854587 | 0.1842511 | |
13 | 0.9787029 | 13.941026 | 3.081482 | 1.854497 | 0.1839985 | |
14 | 0.9787029 | 13.941626 | 3.081482 | 1.854449 | 0.1838658 | |
15 | 0.9787029 | 13.941939 | 3.081482 | 1.854425 | 0.1837963 | |
16 | 0.9787029 | 13.942101 | 3.081482 | 1.854412 | 0.1837600 | |
17 | 0.9787029 | 13.942186 | 3.081482 | 1.854405 | 0.1837410 | |
18 | 0.9787029 | 13.942229 | 3.081482 | 1.854402 | 0.1837310 | |
19 | 0.9787029 | 13.942252 | 3.081482 | 1.854400 | 0.1837259 | |
20 | 0.9787029 | 13.942264 | 3.081482 | 1.854399 | 0.1837232 | |
21 | 0.9787029 | 13.942270 | 3.081482 | 1.854398 | 0.1837218 | |
22 | 0.9787029 | 13.942274 | 3.081482 | 1.854398 | 0.1837210 |
The last line in our table corresponds to the final estimates of our parameters.
slime_plot %+% slime_data
This is a plot of Rick’s reconstructed data: \(slime^{(22)}\). Notice how the imputed data points lie along the regression line- since they were computed using conditional expectation, this is to by design.
See:
Efron, Hastie: Computer Age Statistical Inference: Algorithms, Evidence, and Data Science (2016) chapter 9: Survival Analysis and the EM algorithm
Estimation can become complicated in higher dimensions. Given a high-dimensional probability density, we may not be able to visualize a proposal distribution that matches it well enough to draw i.i.d samples from and hope they coincide with the target density. Instead, we can use dependent sampling, where our next sample draw is intended to visit some area of high probability in the density, based off our knowledge of the previous draw. Sampling in this way gives a Markov Chain. However, we can no longer rely on the law of large numbers for convergence, but instead we have ergodic theorems which give similar asymptotic results for Markov Chains. To be ergodic, the MC must have certain nice properties:
-Every state must be accessible by the Markov Chain. Suppose we have a discrete Markov chain \(X = (X_0, X_1, \dots )\) that has state space \(S = \{1, \dots, m\}\). We want that starting from any state, we have a non-zero probability of eventually reaching any other state. That is, the transition probability \(P_{ij} = P(X_{n+1} = j|X_n = i) > 0\) For some time step \(n\). This means the MC has a “regular” \(m\times m\) transition matrix P, so for some power \(\textbf{P}^n\), the entry \(P_{ij}\) will be positive. -Each state must also be recurrent: once the markov chain visits that state, the chain it “regenerates”" itself and has a probability of visiting it some time in the future again. Thus the chain will visit that state infinitely many times; however, the proportion of visits will be different for each sate. -Additionally the chain must be aperiodic, which means it cannot get stuck in an infinite loop of visiting certain states back and forth.
If we draw a sample from this ergodic chain, it will reach a unique limiting stationary distribution \(\pi = (\pi_1,\dots, \pi_m)\) which describes the on-average chance of being in one state at any given time-step. Since this distribution stays the same over time, we get convergence of the Markov Chain- a histogram of the chain at sample size \(n = 10^5\) should match the histogram at \(n = 10^7\), which will reflect the proportion of times the markov chain visits each state as given by the limiting distribution. Using a sample from an ergodic MC is fundamentally identical to an i.i.d sample, so the ergodic theorem gives us the same result as the law of large numbers: \(\frac{1}{N}\sum_{i=1}^Nh(X_i) \approx E_f[h(X)]\) where \(f\) is the limiting distribution.
We can also use a continuous-state markov chain with an uncountable state space \(\chi\) and a conditional distribution called the transition kernel \(\mathcal{K(x,y)}\) which gives the probability of transitioning from state \(x\) to state \(y\), defined on \(\chi \times \chi\) with the property that \(\mathcal{K}(x,\centerdot) >0\) on its support (i.e all the states are accessible). A Monte Carlo method for simulating a distribution \(f\) is to sample from an ergodic markov chain whose transition kernel converges to \(f\). The result is a sample \(X_1, \dots, X_N\) approximately distributed from \(f\) without having to directly simulate from \(f\) itself!
See:
Dobrow: Introduction to Stochastic Processes with R (2016) Chapter 3: Markov Chains for the Long Term
Robert, Casella: Monte Carlo Statistical Methods (2004) Chapter 7.1: The MCMC Principle
A simple random walk is a markov chain \[X^{(t+1)} |X^{(0)},\dots, X^{(t)} \sim \mathcal{K}(X^{(t)},X^{(t+1)})\] that satisfies \(X^{(t+1)} = X^{(t)} + \epsilon_t\), where \(\epsilon_t\) is independent gaussian noise \(\epsilon_t\sim\mathcal{N}(0,1)\). Then the transition kernel \(\mathcal(K) \sim \mathcal{N}(X^{(t)},1)\)
It turns out, the simple random walk does not converge, so it cannot be described by a stationary distribution.
t <- 10^5
rho <- 1
markov_chain <- numeric(t)
markov_chain[1] = rnorm(1)
for(i in 2:t){
markov_chain[i] <- rnorm(n = 1, mean = markov_chain[i-1] * rho, sd = 1 )
}
mc_frame <- data.frame(head = markov_chain[1:(10^5/2)],
tail = markov_chain[(10^5/2+1):10^5])
ggplot(data = mc_frame) +
geom_freqpoly(aes(x = head), col = "blue", alpha = .5) +
geom_freqpoly(aes(x = tail), col = "orange", alpha = .5)
Since the head and tail of the random walk have completely different distributions, we can see that it won’t converge.
However, if we make a slight contraction, the random walk does in fact converge.
Consider the Markov Chain defined by \(X^{(t+1)} = \rho X^{(t)} + \epsilon_t\) where \(\epsilon_t\sim\mathcal{N}(0,1)\) Using \(\rho = .9\), we will show that the stationary distribution is \(\mathcal{N}\left(0, \frac{1}{1-\rho^2}\right)\).
t <- 10^5
rho <- .9
markov_chain <- numeric(t)
markov_chain[1] = rnorm(1)
for(i in 2:t){
markov_chain[i] <- rnorm(n = 1, mean = markov_chain[i-1] * rho, sd = 1 )
}
stationary_distribution <- rnorm(n = t, mean = 0, sd = sqrt(1/(1-rho^2)))
mc_frame <- data.frame(mc = markov_chain,
s_dis = stationary_distribution)
ggplot(data = mc_frame) +
geom_freqpoly(aes(x = mc), col = "blue", alpha = .5) +
geom_freqpoly(aes(x = s_dis), col = "orange", alpha = .5)
Notice the the stationary distribution would not exist for the simple random walk: it is undefined for \(\rho = 1\).
See:
Robert, Casella: Introducing Monte Carlo Methods with R (2010) Chapter 6.2: A Peek at Markov Chain Theory
Here we use a random walk to generate a sample from a bivariate normal distribution. Though we already have ways to generate random multivaraite normals, this algorithm can be applied to sample from any exponential kernel.
We use a bivariate normal with mean 0 and standard deviation 3 multiplied by 2 to simulate the noise of the random walk. Increasing the noise of the walk can help explore the distribution faster, but may lead to more rejected proposals, slowing down convergence.
library(mvtnorm)
N <- 10^5
chain <- matrix(rep(0,N*2), nrow = N)
#initialize X_0
chain[1,] <- c(2,2)
#acceptance function
f <- function(x){ dmvnorm(x)}
acceptance <- function(y, x){
min(f(y)/f(x) , 1)
}
#generates candidate move
biv_rnorm <- function(){
rnorm(n = 2, sd = 3)
}
#sample the chain
for(t in 1:(N-1)){
X_t <- chain[t,]
Y <- X_t + 2 * biv_rnorm()
alpha <- acceptance(Y, X_t)
u <- runif(n = 1)
if(u < alpha){
chain[t+1,] <- Y
}
else{
chain[t+1,] <- X_t
}
}
hist(chain[,1])
hist(chain[,2])
The histograms show the marginal distributions are standard normal, as expected.
The original Metropolis Hastings algorithm was used to determine the ideal configuration of of a system of N particles. Let’s say each particle has a location in 2-dimensional space describes by \(r\) and \(s\) coordinates, and the state of the system is the coordinates of each particle at a given moment in time. A markov chain can describe the evolution of the system as it transitions from one particle configuration to the next. Assuming the particles are constrained in a unit square, a state vector would look like \(X = \{(r_1, s_1),\dots, (r_N,s_N)\}\in ([0,1]^2)^N\) distributed according the Boltzmann distribution \(\pi(x) = \frac{1}{z}e^{-\mathcal{E}(x)}\) where \(\mathcal{E}\) is an energy function of the state. Our goal is to sample from the Boltzmann distribution and find its expected value, so we can determine the ideal configuration of the system. However, \(z\) may be intractible and we don’t have a sampling procedure. Faced with this problem, the physicists Metropolis and Hastings invented one. Here’s how it works: we start with a target distribution \(\pi\) we want to sample from, and construct a transition kernel that converges to \(\pi\). We need a symmetric proposal transition matrix \(Q\) that suggests the next state of the Markov Chain based off the previous state. \(Q\) can be as simple as a uniform distribution that chooses a nearby configuration at random, as long as it gives an equal transition probability of reversing that configuration. The actual transition matrix of the Markov Chain \(X\) that we construct is given by \(T\) and has transition probability \(T(X_{n+1}|X_n) =\) (Probability that Q proposes transition to \(X_{n+1}=x^*\))(Probability that the transition to \(x^*\) is accepted) and define P(accept \(x^{*}|X_n\)) = min\(\{1, \frac{\pi(x^*)}{\pi(x)}\}\). If Q is symmetric, then \(X\) is time reversible and therefore ergodic with limiting distribution \(\pi\). Let’s illustrate that \(\pi\) is in fact the limiting distribution of T. Recall that a time-reversible Markov Chain is such that its stationary distribution satisfies the property of detailed balance: \(\pi_aT(a|b)=\pi_bT(b|a)\). In our example the proof hinges on the fact that we choose a symmetric proposal matrix Q. \[\pi_bT(a|b) \\ =\pi_bQ(a|b)\min\{1, \frac{\pi_a}{\pi_b}\}\\ = Q(a|b)\min\{\pi_b, \pi_a\} \\ = Q(b|a)\min\{\pi_b, \pi_a\} \\ = \pi_aQ(b|a)\min\{\frac{\pi_b}{\pi_a}, 1\} \\ = \pi_aT(b|a)\] Since \(T\) is time reversible, \(\pi\) is the unique limiting distribution of \(T\). This is the basis of the independence sampler algorithm.
See:
Mathematical Monk on MCMC
Robert, Casella Introducing Monte Carlo Methods with R (2010) Chapter 6: Metropolis-Hastings Algorithm
Here is a comparison of simulations for a gamma(4.3,6.2) density using the following methods:
-a: accept-reject using a gamma(4,7) candidate
-b: Metropolis-Hastings using a gamma(4,7) candidate
-c: Metropolis-Hastings using a gamma(5,6) candidate
g47=rgamma(5000,4,7)
u=runif(5000,max=dgamma(g47,4,7))
x=g47[u<dgamma(g47,4.3,6.2)]
par(mfrow=c(1,3),mar=c(4,4,1,1))
hist(x,freq=FALSE,xlab="",ylab="",col="wheat2",
main="a: Accept-Rej with Ga(4.7) prop")
curve(dgamma(x,4.3,6.2),lwd=2,col="sienna",add=T)
a <- length(x)/5000
X=rep(0,5000)
X[1]=rgamma(1,4.3,6.2)
for (t in 2:5000){
rho=(dgamma(X[t-1],4,7)*dgamma(g47[t],4.3,6.2))/
(dgamma(g47[t],4,7)*dgamma(X[t-1],4.3,6.2))
X[t]=X[t-1]+(g47[t]-X[t-1])*(runif(1)<rho) #move to g47[t] if it is accepted, otherwise stay
}
hist(X,freq=FALSE,xlab="",ylab="",col="wheat2",
main="b: Met-Hast with Ga(4,7) prop")
curve(dgamma(x,4.3,6.2),lwd=2,col="sienna",add=T)
b <- length(unique(X))/5000
g56=rgamma(5000,5,6)
X[1]=rgamma(1,4.3,6.2)
for (t in 2:5000){
rho=(dgamma(X[t-1],5,6)*dgamma(g56[t],4.3,6.2))/
(dgamma(g56[t],5,6)*dgamma(X[t-1],4.3,6.2))
X[t]=X[t-1]+(g56[t]-X[t-1])*(runif(1)<rho)
}
c <- length(unique(X))/5000
hist(X,freq=FALSE,xlab="",ylab="",col="wheat2",
main="c: Met-Hast with Ga(5,6) prop")
curve(dgamma(x,4.3,6.2),lwd=2,col="sienna",add=T)
Method | Efficiency |
---|---|
a | 0.8432 |
b | 0.7934 |
c | 0.781 |
The three methods produce similar results.
Robert, Casella Introducing Monte Carlo Methods with R (2010) Chapter 6: Metropolis-Hastings Algorithm, exercise 6.9
The Gibb’s sampler is a method to sample from a multivariate distrubtion by generating a Markov Chain. Suppose we have \(m\)-dimensional random variable \(X = (X_1,\dots,X_m) \sim \pi\). The Gibb’s sampler samples from the full conditionals of \(\pi\): \(\pi(X_1|X_2, \dots X_m)\), \(\pi(X_2|X_1, X_3, \dots, X_m)\), \(\dots\), \(\pi(X_m|X_1,\dots ,X_{m-1})\) to generate a Markov Chain whose target distribution is \(\pi(X)\). It turns out the Gibb’s Sampler is a special case of Metropolis-Hastings where the probability of accepting the proposed state is always 1. The proposed state is determined iteratively sing the full conditional distributions. If we initialize the first state as \((x^{(0)}_1,\dots, x^{(0)}_m\)), we have \(x^{(1)}_1\sim P(X_1|x^{(0)}_2, \dots, x^{(0)}_m)\). Then update the current state with \(x_1^{(0)} \rightarrow x_1^{(1)}\) and sample the next dimension \(x^{(1)}_2\sim P(X_2|x_1^{(1)}, x^{(0)}_3, \dots, x^{(0)}_m)\). Keep in mind that the superscript is the time-step and the subscript is the dimension. We iterate \(m\) times to generate the proposal for the next state, which is then accepted automatically, and repeat to get a sample. The sample we generate is a Markov Chain with limiting distribution \(\pi\), which is the target distribution. The advantage of the Gibb’s sampler is that it can sample from a high-dimensional density using univariate densities, simplifying the problem. Of course, we will need to be able to determine the conditional distributions before we can sample from them.
We will implement the Gibb’s sampler to simulate the following density: \[\pi(x, p, n) \propto \binom{n}{x} p^x(1-p)^{n-x}\frac{4^n}{n!}\] where \(x, n \in \mathbb{Z}_{\geq0}\) and \(p\in (0,1)\) To identify the conditional distributions, hold the two other variable constant, and we see that\[X| N=n, P=p \propto \text{binomial}(n, p)\] \[P|X = x, N= n \propto \text{beta}(x+1, n-x+1)\] \[N|P=p, X=x \propto Z + x \text{ where } Z\sim \text{poisson}(\lambda = 4(1-p))\]
N <- 10^3
sim <- matrix(rep(0,3*N),ncol=3)
#Initialize first state
sim[1, ] <- c(1,.5,2) #x, p, n
#Run Gibb's Sampler
for(i in 2:N){
sim[i,1] <- rbinom(1,sim[i-1,3],sim[i-1,2])
sim[i,2] <- rbeta(1,sim[i,1]+1,sim[i-1,3]-sim[i,1]+1)
sim[i,3] <- rpois(1,4*(1-sim[i,2]))+sim[i,1]
}
#plot marginal distributions
gibbs_sample <- data.frame(x = sim[, 1],
p = sim[, 2],
n = sim[, 3])
marginals <- list(NULL, NULL, NULL)
chart <- function(var){
ggplot(data = gibbs_sample, aes_(x = var)) + geom_histogram(color = "violet")
}
#note: the input to chart is a name, not a string
for(i in 1:3){
marginals[[i]] <- chart(as.name(names(gibbs_sample)[i]))
}
grid.arrange(marginals[[1]], marginals[[2]], marginals[[3]], ncol = 3, top = "Marginal Distributions")
See:
Dobrow: Introduction to Stochastic Processes with R (2016) chapter 5.3: gibb’s sampler