--- title: "Differential Evolution on a curved target (the banana)" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Differential Evolution on a curved target (the banana)} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4.2) Sys.setenv(RAYON_NUM_THREADS = "2") ``` This vignette is the third chapter of the worked-case companion to the package: one model at a time, the question the user faces, the failure of the default sampler, and the call that fixes it. The model here is the twisted Gaussian of Haario, Saksman and Tamminen (2001), the standard stress test for samplers on a curved target, and the call that fixes it is `gpu_metropolis(method = "de")`, the v0.4.0 release. ## The model The banana is a Gaussian bent along a parabola. In two dimensions its log-density, up to a constant, is $$ \log p(x_1, x_2) = -\frac{x_1^2}{200} - \frac12\bigl(x_2 + b\,x_1^2 - 100\,b\bigr)^2 , $$ with $b = 0.1$. The first coordinate is wide, with standard deviation near 10; the second is pinned to the parabola $x_2 \approx 100b - b\,x_1^2$, so the local correlation between $x_1$ and $x_2$ changes sign and strength as $x_1$ moves along the ridge. No single diagonal proposal can match a geometry whose correlation is different at every point of the ridge. The target has no data; it is declared through the prior. This is the general way to sample an arbitrary target with no observed data: set `loglik = ~ 0`, so the log-likelihood contributes nothing, and put the whole target in `prior`. The sampler then draws from the prior, which here is the banana. ```{r model} library(gpumetropolis) banana <- gpum_model( loglik = ~ 0, params = c("x1", "x2"), prior = ~ -x1^2 / 200 - 0.5 * (x2 + 0.1 * x1^2 - 10)^2 ) # Overdispersed starting ensemble across the ridge. set.seed(11) init <- cbind(stats::rnorm(40, 0, 8), stats::rnorm(40, 0, 4)) ``` ## The failure mode of diagonal random-walk Metropolis The default `method = "rwm"` adapts a per-chain, per-dimension proposal scale during warmup. That is the best a diagonal proposal can do, and it is not enough here: a scale tuned for the wide $x_1$ axis overshoots the narrow conditional spread of $x_2$, and a scale tuned for $x_2$ crawls along $x_1$. The chain cannot follow the bend. ```{r rwm} fit_rwm <- gpu_metropolis(banana, init = init, n_iter = 8000, method = "rwm", seed = 1, backend = "cpu") rhat_rwm <- c(rhat(fit_rwm$draws[, , "x1"], 0), rhat(fit_rwm$draws[, , "x2"], 0)) ess_rwm <- c(ess(fit_rwm$draws[, , "x1"], 0), ess(fit_rwm$draws[, , "x2"], 0)) ``` ## The fix: Differential Evolution `method = "de"` builds each proposal from the difference of two other chains in the ensemble. Where the ensemble straddles the ridge, those difference vectors point along the ridge, so the proposal is automatically oriented to the local geometry, with no covariance to estimate and no scale to tune. ```{r de} fit_de <- gpu_metropolis(banana, init = init, n_iter = 8000, method = "de", seed = 1, backend = "cpu") rhat_de <- c(rhat(fit_de$draws[, , "x1"], 0), rhat(fit_de$draws[, , "x2"], 0)) ess_de <- c(ess(fit_de$draws[, , "x1"], 0), ess(fit_de$draws[, , "x2"], 0)) ``` The two samplers run the same chains for the same number of iterations from the same start. The contrast is in the mixing. ```{r compare} data.frame( method = c("rwm", "de"), rhat_x1 = c(rhat_rwm[1], rhat_de[1]), rhat_x2 = c(rhat_rwm[2], rhat_de[2]), ess_x1 = c(ess_rwm[1], ess_de[1]), ess_x2 = c(ess_rwm[2], ess_de[2]) ) ``` ## Reading the result ```{r fig, fig.alt = "Pooled draws of the banana target for random-walk Metropolis and Differential Evolution."} op <- graphics::par(mfrow = c(1, 2), mar = c(4, 4, 2, 1)) xr <- range(c(fit_rwm$draws[, , "x1"], fit_de$draws[, , "x1"])) yr <- range(c(fit_rwm$draws[, , "x2"], fit_de$draws[, , "x2"])) plot(as.vector(fit_rwm$draws[, , "x1"]), as.vector(fit_rwm$draws[, , "x2"]), pch = ".", xlim = xr, ylim = yr, xlab = "x1", ylab = "x2", main = sprintf("rwm (R-hat x1 %.2f)", rhat_rwm[1])) plot(as.vector(fit_de$draws[, , "x1"]), as.vector(fit_de$draws[, , "x2"]), pch = ".", xlim = xr, ylim = yr, xlab = "x1", ylab = "x2", main = sprintf("de (R-hat x1 %.2f)", rhat_de[1])) graphics::par(op) ``` The Differential Evolution panel fills the curved ridge more evenly and reaches a markedly lower R-hat at the same iteration budget, with a higher effective sample size; the random-walk panel covers the ridge unevenly, with the higher R-hat that is the signature of chains still mixing across the bend. The banana is a hard target and neither run is fully converged at 8000 iterations, so both gain from a longer run; the point of the comparison is the gap between them at equal cost. The one-call diagnostic prints the per-parameter table with the convergence verdict and opens a panel per parameter; for a Differential Evolution fit it adds a row with the per-chain acceptance and the ensemble spread per dimension over the batches, the quantity that keeps the difference proposal alive. Reading that row: the spread should stay well above zero, a spread collapsing toward zero in a dimension is the warning that the difference vectors have degenerated there. ```{r diagnose, fig.alt = "Differential Evolution diagnostic panels for the banana fit.", fig.width = 7, fig.height = 6} gpum_diagnose(fit_de) ``` ## Honest positioning The banana is a textbook success of population samplers, not a novelty: any correct Differential Evolution implementation recovers the ridge, and a gradient sampler such as the no-U-turn sampler would do as well or better on a target this smooth. The value `gpumetropolis` adds is that the same formula a user already writes becomes a kernel that runs on the CPU and the GPU from one source, and that `method = "de"` is a built-in path rather than a hand-coded proposal. The next chapter turns to a correlated posterior with a closed-form Cramer-Rao reference, where the correctness of the recovered geometry can be checked against theory.