This vignette is the third chapter
of the worked-case companion to the package: one model at a time, the
question the user faces, the failure of the default sampler, and the
call that fixes it. The model here is the twisted Gaussian of Haario,
Saksman and Tamminen (2001), the standard stress test for samplers on a
curved target, and the call that fixes it is
gpu_metropolis(method = "de"), the v0.4.0 release.
The banana is a Gaussian bent along a parabola. In two dimensions its log-density, up to a constant, is
\[ \log p(x_1, x_2) = -\frac{x_1^2}{200} - \frac12\bigl(x_2 + b\,x_1^2 - 100\,b\bigr)^2 , \]
with \(b = 0.1\). The first
coordinate is wide, with standard deviation near 10; the second is
pinned to the parabola \(x_2 \approx 100b -
b\,x_1^2\), so the local correlation between \(x_1\) and \(x_2\) changes sign and strength as \(x_1\) moves along the ridge. No single
diagonal proposal can match a geometry whose correlation is different at
every point of the ridge. The target has no data; it is declared through
the prior. This is the general way to sample an arbitrary target with no
observed data: set loglik = ~ 0, so the log-likelihood
contributes nothing, and put the whole target in prior. The
sampler then draws from the prior, which here is the banana.
The default method = "rwm" adapts a per-chain,
per-dimension proposal scale during warmup. That is the best a diagonal
proposal can do, and it is not enough here: a scale tuned for the wide
\(x_1\) axis overshoots the narrow
conditional spread of \(x_2\), and a
scale tuned for \(x_2\) crawls along
\(x_1\). The chain cannot follow the
bend.
method = "de" builds each proposal from the difference
of two other chains in the ensemble. Where the ensemble straddles the
ridge, those difference vectors point along the ridge, so the proposal
is automatically oriented to the local geometry, with no covariance to
estimate and no scale to tune.
fit_de <- gpu_metropolis(banana, init = init, n_iter = 8000,
method = "de", seed = 1, backend = "cpu")
rhat_de <- c(rhat(fit_de$draws[, , "x1"], 0),
rhat(fit_de$draws[, , "x2"], 0))
ess_de <- c(ess(fit_de$draws[, , "x1"], 0),
ess(fit_de$draws[, , "x2"], 0))The two samplers run the same chains for the same number of iterations from the same start. The contrast is in the mixing.
data.frame(
method = c("rwm", "de"),
rhat_x1 = c(rhat_rwm[1], rhat_de[1]),
rhat_x2 = c(rhat_rwm[2], rhat_de[2]),
ess_x1 = c(ess_rwm[1], ess_de[1]),
ess_x2 = c(ess_rwm[2], ess_de[2])
)
#> method rhat_x1 rhat_x2 ess_x1 ess_x2
#> 1 rwm 1.088992 1.090526 1141.024 1319.172
#> 2 de 1.034410 1.037232 1942.358 2211.578op <- graphics::par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))
xr <- range(c(fit_rwm$draws[, , "x1"], fit_de$draws[, , "x1"]))
yr <- range(c(fit_rwm$draws[, , "x2"], fit_de$draws[, , "x2"]))
plot(as.vector(fit_rwm$draws[, , "x1"]), as.vector(fit_rwm$draws[, , "x2"]),
pch = ".", xlim = xr, ylim = yr, xlab = "x1", ylab = "x2",
main = sprintf("rwm (R-hat x1 %.2f)", rhat_rwm[1]))
plot(as.vector(fit_de$draws[, , "x1"]), as.vector(fit_de$draws[, , "x2"]),
pch = ".", xlim = xr, ylim = yr, xlab = "x1", ylab = "x2",
main = sprintf("de (R-hat x1 %.2f)", rhat_de[1]))The Differential Evolution panel fills the curved ridge more evenly and reaches a markedly lower R-hat at the same iteration budget, with a higher effective sample size; the random-walk panel covers the ridge unevenly, with the higher R-hat that is the signature of chains still mixing across the bend. The banana is a hard target and neither run is fully converged at 8000 iterations, so both gain from a longer run; the point of the comparison is the gap between them at equal cost. The one-call diagnostic prints the per-parameter table with the convergence verdict and opens a panel per parameter; for a Differential Evolution fit it adds a row with the per-chain acceptance and the ensemble spread per dimension over the batches, the quantity that keeps the difference proposal alive. Reading that row: the spread should stay well above zero, a spread collapsing toward zero in a dimension is the warning that the difference vectors have degenerated there.
gpum_diagnose(fit_de)
#> <gpum_diagnose: Inconclusive>
#> method de, gamma 1.190, backend cpu, chains 40, iterations 4000 (warmup 4000, adaptive)
#>
#> parameter mean sd q2.5 q50 q97.5 Rhat ESS MCSE
#> x1 0.3690 8.5186 -16.4449 0.5142 16.4756 1.0344 1942.3580 0.1933
#> x2 2.7471 9.3669 -23.1377 6.3237 11.1206 1.0372 2211.5780 0.1992
#>
#> Increase n_iter to raise the effective sample size.The banana is a textbook success of population samplers, not a
novelty: any correct Differential Evolution implementation recovers the
ridge, and a gradient sampler such as the no-U-turn sampler would do as
well or better on a target this smooth. The value
gpumetropolis adds is that the same formula a user already
writes becomes a kernel that runs on the CPU and the GPU from one
source, and that method = "de" is a built-in path rather
than a hand-coded proposal. The next chapter turns to a correlated
posterior with a closed-form Cramer-Rao reference, where the correctness
of the recovered geometry can be checked against theory.