---
title: "Differential Evolution on a correlated posterior, with a Cramer-Rao reference"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Differential Evolution on a correlated posterior, with a Cramer-Rao reference}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>",
                      fig.width = 7, fig.height = 4.2)
Sys.setenv(RAYON_NUM_THREADS = "2")
```

This vignette is the fourth chapter of the worked-case companion. The target
is a strongly correlated bivariate posterior whose covariance is known in
closed form, so the geometry the sampler recovers can be checked against the
Cramer-Rao bound, the inverse Fisher information. The call is
`gpu_metropolis(method = "de")` of v0.4.0.

## The model

Let $y_1, \ldots, y_n$ be drawn from a bivariate normal with unknown mean
$\mu = (\mu_1, \mu_2)$ and a known covariance $\Sigma$ with correlation
$\rho = 0.9$. With a flat prior on $\mu$, the posterior is

$$
\mu \mid y \sim \mathcal{N}_2\!\left(\bar y,\; \Sigma / n\right),
$$

a bivariate normal carrying the same correlation $0.9$. A diagonal
random-walk proposal cannot represent that correlation in a single move; the
difference proposal of Differential Evolution can.

```{r data}
library(gpumetropolis)
set.seed(1)
rho   <- 0.9
Sigma <- matrix(c(1, rho, rho, 1), 2, 2)
n     <- 500
L     <- chol(Sigma)
y     <- matrix(stats::rnorm(n * 2), n, 2) %*% L
y[, 1] <- y[, 1] + 2      # truth: mu1 = 2
y[, 2] <- y[, 2] - 1      # truth: mu2 = -1
ybar     <- colMeans(y)
post_cov <- Sigma / n     # closed-form posterior covariance
```

The per-observation Gaussian log-likelihood uses $\Sigma^{-1}$ as constants.
When a coefficient of the formula is a number computed in R, the supported way
to put it into the DSL is to build the formula text with `sprintf()` and turn
it into a formula with `as.formula()`, as below; the entries of $\Sigma^{-1}$
enter as literal constants the kernel reads.

```{r model}
Si <- solve(Sigma)
loglik <- stats::as.formula(sprintf(
  "~ -0.5 * (%.6f*(y1-mu1)^2 + 2*%.6f*(y1-mu1)*(y2-mu2) + %.6f*(y2-mu2)^2)",
  Si[1, 1], Si[1, 2], Si[2, 2]))
model <- gpum_model(loglik, params = c("mu1", "mu2"), data = c("y1", "y2"))
```

## Random walk against Differential Evolution

Both methods sample this unimodal posterior correctly; the difference is
efficiency. The diagonal random walk pays for ignoring the correlation with a
lower effective sample size per iteration.

```{r fits}
dat <- list(y1 = y[, 1], y2 = y[, 2])
fit_rwm <- gpu_metropolis(model, data = dat, n_iter = 8000, n_chains = 16,
                          method = "rwm", seed = 1, backend = "cpu")
fit_de  <- gpu_metropolis(model, data = dat, n_iter = 8000, n_chains = 16,
                          method = "de", seed = 1, backend = "cpu")

eff <- function(fit) c(
  ess_mu1 = ess(fit$draws[, , "mu1"], 0),
  ess_mu2 = ess(fit$draws[, , "mu2"], 0),
  corr    = stats::cor(as.vector(fit$draws[, , "mu1"]),
                       as.vector(fit$draws[, , "mu2"]))
)
rbind(rwm = eff(fit_rwm), de = eff(fit_de))
```

The recovered correlation sits near the target $0.9$ for both, and the
Differential Evolution run reaches a higher effective sample size on the
correlated geometry.

## The Cramer-Rao reference

For $n$ independent draws from a normal with known $\Sigma$, the Fisher
information about the mean is $I(\mu) = n\,\Sigma^{-1}$, so the Cramer-Rao
lower bound on the covariance of an unbiased estimator is

$$
I(\mu)^{-1} = \Sigma / n ,
$$

which is exactly the posterior covariance here. This model is regular and
the posterior is Gaussian, so the Bernstein-von Mises identification of the
posterior covariance with the inverse Fisher information is not an
approximation but an equality; the comparison is a clean check that the
sampler recovers the information-bound geometry rather than a frequentist
claim layered onto a Bayesian fit. The estimator-theory reading is that the
posterior mean here is consistent and asymptotically efficient: it concentrates
on the true mean and attains the Cramer-Rao bound, so the bound is the natural
yardstick for its spread.

The package computes the bound directly with `gpum_crlb()`, which forms the
observed Fisher information by a central-difference Hessian of the same
compiled log-likelihood the sampler used, then inverts it. It returns the
bound beside the empirical posterior spread.

```{r crlb}
cr <- gpum_crlb(fit_de, data = dat)
cr
```

The reported bound matches the analytic inverse Fisher information
`solve(n * Sigma^{-1}) = Sigma / n` for this model, and the posterior spread
from the Differential Evolution draws tracks it to within Monte Carlo error,
correlation included.

```{r crlb-check}
analytic_crlb <- solve(n * Si)              # = Sigma / n, in closed form
round(cbind(gpum_crlb = cr$crlb_sd,
            analytic  = sqrt(diag(analytic_crlb))), 4)
```

The package draws the comparison in one call. `gpum_pairs(fit, crlb = cr)`
shows the marginal posteriors with their highest density intervals on the
diagonal and, off the diagonal, the bivariate draws with their credible-region
contour in black and the Cramer-Rao ellipse in red. Reading it: the black
empirical region and the red theoretical ellipse coincide when the recovered
spread matches the information bound, which is the case here.

```{r fig, fig.alt = "Pairs plot of the correlated posterior with the Cramer-Rao ellipse overlaid by gpum_pairs.", fig.width = 6.5, fig.height = 6}
gpum_pairs(fit_de, crlb = cr, level = 0.95)
```

The `plot()` method of the Cramer-Rao object gives the same verdict as a bar
chart, the posterior standard deviation against the bound, parameter by
parameter.

```{r crlb-bars, fig.alt = "Posterior standard deviation against the Cramer-Rao bound, by parameter.", fig.width = 6, fig.height = 4}
plot(cr)
```

## Consistency and the n to the minus one-half rate

The Cramer-Rao comparison at a single $n$ is one slice of a sharper statement:
as the data grow, the posterior mean concentrates on the truth (consistency)
and its spread shrinks at the rate $n^{-1/2}$ set by the bound (asymptotic
efficiency). The appendix below makes both visible by refitting the same model
at a sequence of sample sizes and reading off the error of the posterior mean
against the truth and the posterior standard deviation against the closed-form
$\sqrt{1/n}$ reference. This is a simulation check, possible only because the
truth is known here; it is a validation of the theory, not a per-fit
diagnostic.

```{r consistency}
ns <- c(125, 500, 2000)
rows <- lapply(ns, function(nn) {
  yy <- matrix(stats::rnorm(nn * 2), nn, 2) %*% L
  yy[, 1] <- yy[, 1] + 2
  yy[, 2] <- yy[, 2] - 1
  f <- gpu_metropolis(model, data = list(y1 = yy[, 1], y2 = yy[, 2]),
                      n_iter = 4000, n_chains = 12, method = "de",
                      seed = 1, backend = "cpu")
  m1 <- as.vector(f$draws[, , "mu1"])
  data.frame(n = nn,
             mean_error = abs(mean(m1) - 2),   # distance of the estimate to truth
             post_sd = stats::sd(m1),
             crlb_sd = sqrt(1 / nn))           # Sigma_11 / n, here Sigma_11 = 1
})
tab <- do.call(rbind, rows)
tab$sd_ratio <- tab$post_sd / tab$crlb_sd
tab
```

The error of the posterior mean stays within the posterior standard deviation
at every size and shrinks with it as the sample grows, the signature of
consistency; the single-replication error is itself noisy, so it is the trend
and the scale, not a strict ordering, that carry the point. The posterior
standard deviation tracks the $\sqrt{1/n}$ reference with a ratio near one at
every size, the signature of asymptotic efficiency: it roughly halves each
time $n$ quadruples, as the printed `post_sd` column against `crlb_sd` shows. Consistency is the more robust of the two: it can hold on
identifiable models where the normal approximation behind the efficiency rate
does not, which is why `gpum_crlb()` reports the bound only where the stronger
conditions also hold.

## Honest positioning

The agreement above is what theory predicts for a regular Gaussian model; it
is a validation of the sampler, not a discovery. The reference is worth
stating precisely because it does not transfer to the harder targets the
package also serves: the bimodal M2 of the parallel-tempering chapter and any
copula parameter near the independence boundary violate the regularity the
Cramer-Rao bound assumes, and the posterior there is not asymptotically
normal. The lesson of this chapter is narrow and exact: on a regular
correlated target, `method = "de"` recovers the full covariance, correlation
included, that the inverse Fisher information prescribes.