Differential Evolution on a correlated posterior, with a Cramer-Rao reference

This vignette is the fourth chapter of the worked-case companion. The target is a strongly correlated bivariate posterior whose covariance is known in closed form, so the geometry the sampler recovers can be checked against the Cramer-Rao bound, the inverse Fisher information. The call is gpu_metropolis(method = "de") of v0.4.0.

The model

Let \(y_1, \ldots, y_n\) be drawn from a bivariate normal with unknown mean \(\mu = (\mu_1, \mu_2)\) and a known covariance \(\Sigma\) with correlation \(\rho = 0.9\). With a flat prior on \(\mu\), the posterior is

\[ \mu \mid y \sim \mathcal{N}_2\!\left(\bar y,\; \Sigma / n\right), \]

a bivariate normal carrying the same correlation \(0.9\). A diagonal random-walk proposal cannot represent that correlation in a single move; the difference proposal of Differential Evolution can.

library(gpumetropolis)
set.seed(1)
rho   <- 0.9
Sigma <- matrix(c(1, rho, rho, 1), 2, 2)
n     <- 500
L     <- chol(Sigma)
y     <- matrix(stats::rnorm(n * 2), n, 2) %*% L
y[, 1] <- y[, 1] + 2      # truth: mu1 = 2
y[, 2] <- y[, 2] - 1      # truth: mu2 = -1
ybar     <- colMeans(y)
post_cov <- Sigma / n     # closed-form posterior covariance

The per-observation Gaussian log-likelihood uses \(\Sigma^{-1}\) as constants. When a coefficient of the formula is a number computed in R, the supported way to put it into the DSL is to build the formula text with sprintf() and turn it into a formula with as.formula(), as below; the entries of \(\Sigma^{-1}\) enter as literal constants the kernel reads.

Si <- solve(Sigma)
loglik <- stats::as.formula(sprintf(
  "~ -0.5 * (%.6f*(y1-mu1)^2 + 2*%.6f*(y1-mu1)*(y2-mu2) + %.6f*(y2-mu2)^2)",
  Si[1, 1], Si[1, 2], Si[2, 2]))
model <- gpum_model(loglik, params = c("mu1", "mu2"), data = c("y1", "y2"))

Random walk against Differential Evolution

Both methods sample this unimodal posterior correctly; the difference is efficiency. The diagonal random walk pays for ignoring the correlation with a lower effective sample size per iteration.

dat <- list(y1 = y[, 1], y2 = y[, 2])
fit_rwm <- gpu_metropolis(model, data = dat, n_iter = 8000, n_chains = 16,
                          method = "rwm", seed = 1, backend = "cpu")
fit_de  <- gpu_metropolis(model, data = dat, n_iter = 8000, n_chains = 16,
                          method = "de", seed = 1, backend = "cpu")

eff <- function(fit) c(
  ess_mu1 = ess(fit$draws[, , "mu1"], 0),
  ess_mu2 = ess(fit$draws[, , "mu2"], 0),
  corr    = stats::cor(as.vector(fit$draws[, , "mu1"]),
                       as.vector(fit$draws[, , "mu2"]))
)
rbind(rwm = eff(fit_rwm), de = eff(fit_de))
#>      ess_mu1  ess_mu2      corr
#> rwm 2223.738 2168.689 0.8992094
#> de  8387.983 8606.421 0.9008327

The recovered correlation sits near the target \(0.9\) for both, and the Differential Evolution run reaches a higher effective sample size on the correlated geometry.

The Cramer-Rao reference

For \(n\) independent draws from a normal with known \(\Sigma\), the Fisher information about the mean is \(I(\mu) = n\,\Sigma^{-1}\), so the Cramer-Rao lower bound on the covariance of an unbiased estimator is

\[ I(\mu)^{-1} = \Sigma / n , \]

which is exactly the posterior covariance here. This model is regular and the posterior is Gaussian, so the Bernstein-von Mises identification of the posterior covariance with the inverse Fisher information is not an approximation but an equality; the comparison is a clean check that the sampler recovers the information-bound geometry rather than a frequentist claim layered onto a Bayesian fit. The estimator-theory reading is that the posterior mean here is consistent and asymptotically efficient: it concentrates on the true mean and attains the Cramer-Rao bound, so the bound is the natural yardstick for its spread.

The package computes the bound directly with gpum_crlb(), which forms the observed Fisher information by a central-difference Hessian of the same compiled log-likelihood the sampler used, then inverts it. It returns the bound beside the empirical posterior spread.

cr <- gpum_crlb(fit_de, data = dat)
cr
#> <gpum_crlb>
#>  parameter posterior_sd cramer_rao_sd  ratio
#>        mu1      0.04447       0.04472 0.9944
#>        mu2      0.04468       0.04472 0.9992
#> 
#>   posterior correlation 0.901, Cramer-Rao 0.900
#> 
#>   Frequentist asymptotic reference (inverse Fisher information), prior-free. A weak/flat prior is assumed; an informative prior makes the posterior tighter.

The reported bound matches the analytic inverse Fisher information solve(n * Sigma^{-1}) = Sigma / n for this model, and the posterior spread from the Differential Evolution draws tracks it to within Monte Carlo error, correlation included.

analytic_crlb <- solve(n * Si)              # = Sigma / n, in closed form
round(cbind(gpum_crlb = cr$crlb_sd,
            analytic  = sqrt(diag(analytic_crlb))), 4)
#>      gpum_crlb analytic
#> [1,]    0.0447   0.0447
#> [2,]    0.0447   0.0447

The package draws the comparison in one call. gpum_pairs(fit, crlb = cr) shows the marginal posteriors with their highest density intervals on the diagonal and, off the diagonal, the bivariate draws with their credible-region contour in black and the Cramer-Rao ellipse in red. Reading it: the black empirical region and the red theoretical ellipse coincide when the recovered spread matches the information bound, which is the case here.

gpum_pairs(fit_de, crlb = cr, level = 0.95)

Pairs plot of the correlated posterior with the Cramer-Rao ellipse overlaid by gpum_pairs.

The plot() method of the Cramer-Rao object gives the same verdict as a bar chart, the posterior standard deviation against the bound, parameter by parameter.

plot(cr)

Posterior standard deviation against the Cramer-Rao bound, by parameter.

Consistency and the n to the minus one-half rate

The Cramer-Rao comparison at a single \(n\) is one slice of a sharper statement: as the data grow, the posterior mean concentrates on the truth (consistency) and its spread shrinks at the rate \(n^{-1/2}\) set by the bound (asymptotic efficiency). The appendix below makes both visible by refitting the same model at a sequence of sample sizes and reading off the error of the posterior mean against the truth and the posterior standard deviation against the closed-form \(\sqrt{1/n}\) reference. This is a simulation check, possible only because the truth is known here; it is a validation of the theory, not a per-fit diagnostic.

ns <- c(125, 500, 2000)
rows <- lapply(ns, function(nn) {
  yy <- matrix(stats::rnorm(nn * 2), nn, 2) %*% L
  yy[, 1] <- yy[, 1] + 2
  yy[, 2] <- yy[, 2] - 1
  f <- gpu_metropolis(model, data = list(y1 = yy[, 1], y2 = yy[, 2]),
                      n_iter = 4000, n_chains = 12, method = "de",
                      seed = 1, backend = "cpu")
  m1 <- as.vector(f$draws[, , "mu1"])
  data.frame(n = nn,
             mean_error = abs(mean(m1) - 2),   # distance of the estimate to truth
             post_sd = stats::sd(m1),
             crlb_sd = sqrt(1 / nn))           # Sigma_11 / n, here Sigma_11 = 1
})
tab <- do.call(rbind, rows)
tab$sd_ratio <- tab$post_sd / tab$crlb_sd
tab
#>      n  mean_error    post_sd    crlb_sd  sd_ratio
#> 1  125 0.031489967 0.08889299 0.08944272 0.9938538
#> 2  500 0.009717197 0.04525701 0.04472136 1.0119774
#> 3 2000 0.019773891 0.02247282 0.02236068 1.0050150

The error of the posterior mean stays within the posterior standard deviation at every size and shrinks with it as the sample grows, the signature of consistency; the single-replication error is itself noisy, so it is the trend and the scale, not a strict ordering, that carry the point. The posterior standard deviation tracks the \(\sqrt{1/n}\) reference with a ratio near one at every size, the signature of asymptotic efficiency: it roughly halves each time \(n\) quadruples, as the printed post_sd column against crlb_sd shows. Consistency is the more robust of the two: it can hold on identifiable models where the normal approximation behind the efficiency rate does not, which is why gpum_crlb() reports the bound only where the stronger conditions also hold.

Honest positioning

The agreement above is what theory predicts for a regular Gaussian model; it is a validation of the sampler, not a discovery. The reference is worth stating precisely because it does not transfer to the harder targets the package also serves: the bimodal M2 of the parallel-tempering chapter and any copula parameter near the independence boundary violate the regularity the Cramer-Rao bound assumes, and the posterior there is not asymptotically normal. The lesson of this chapter is narrow and exact: on a regular correlated target, method = "de" recovers the full covariance, correlation included, that the inverse Fisher information prescribes.