---
title: "Parallel tempering for bimodal posteriors (M2)"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Parallel tempering for bimodal posteriors (M2)}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>",
                      fig.width = 7, fig.height = 4.2)
Sys.setenv(RAYON_NUM_THREADS = "2")
```

This vignette is the first chapter of a worked-case companion to the package:
one model at a time, the question the user faces, the failure of the default
sampler, and the call that fixes it. The model here is M2 of the registered
benchmark of `gpumetropolis`, a separated bimodal posterior, and the call
that fixes it is `gpu_metropolis(method = "pt")`, the v0.3.0 release.

## The model

Observations $y_1, \ldots, y_N$ are drawn from $\mathcal{N}(|\mu|,\, \sigma)$
with $\sigma$ known. The likelihood depends on $\mu$ only through $|\mu|$, so
the posterior of $\mu$ is symmetric and bimodal with modes near $+c$ and $-c$
where $c$ is positive and close to $\mathrm{mean}(|y|)$. A single random-walk
chain settles in one basin; crossing the low-density region around $\mu = 0$
is exponentially rare once the modes separate.

```{r data}
library(gpumetropolis)
set.seed(11)
y <- rnorm(400, mean = 3, sd = 1)     # truth: mu = +3
model <- gpum_model(
  loglik = ~ log(exp(-((y - mu)^2) / 2) + exp(-((y + mu)^2) / 2)),
  params = "mu", data = "y"
)
init <- matrix(seq(-6, 6, length.out = 8), nrow = 8, ncol = 1L)
```

## The failure mode of plain random-walk Metropolis

The default `method = "rwm"` with eight overdispersed chains looks at first
like it covers both modes, because the **pooled** draws across chains include
both $+3$ and $-3$:

```{r rwm-fit}
fit_rwm <- gpu_metropolis(model, data = list(y = y), init = init,
                          proposal_sd = 0.15, n_iter = 4000,
                          seed = 1, backend = "cpu")
fit_rwm
```

The acceptance rate is healthy and the posterior mean is near zero, the
average of the two modes. The trap is that **each chain is stuck in the basin
near its starting point**: a chain seeded at $-6$ converges to $-3$ and stays
there; a chain seeded at $+6$ converges to $+3$ and stays there. The
mode-crossing event has probability proportional to
$\exp(-(\text{barrier height}))$ per step, vanishing for the M2 separation.

The diagnostic that catches this is the split R-hat, not the pooled KS
test against the reference. The pooled KS statistic averages the two
within-mode CDFs and looks correct; R-hat compares within-chain and
between-chain variances and reports the mismatch.

```{r rwm-rhat}
rhat(fit_rwm$draws[, , "mu"], warmup = 0)
```

R-hat far above one is the honest verdict on this run: the chains did not
mix.

## Parallel tempering

`method = "pt"` adds an auxiliary set of chains at higher temperatures on the
same target. The hot chains accept moves more easily because their tempered
acceptance ratio is $\exp\bigl((\log\pi(y') - \log\pi(y)) / T_c\bigr)$ with
$T_c > 1$; the cold chain, at $T = 1$, samples the actual posterior. Between
batches an adjacent-pair swap step proposes exchanges of states, accepted
with the ratio
$\exp\bigl((\log\pi(x_{c+1}) - \log\pi(x_c)) \cdot (1/T_c - 1/T_{c+1})\bigr)$.
The cold chain inherits mode-crossings from the hot ones through accepted
swaps; the hot chains keep mixing because their tempered target is
substantially flatter than the cold one.

```{r pt-fit}
fit_pt <- gpu_metropolis(model, data = list(y = y), init = init,
                         proposal_sd = 0.15, n_iter = 4000,
                         method = "pt", seed = 1, backend = "cpu")
fit_pt
```

The print method labels the run as parallel tempering, shows the band of
swap acceptances across pairs, and reports the posterior summary of the cold
chain alone. R-hat on the cold chain is near one:

```{r pt-rhat}
rhat(matrix(fit_pt$draws[, 1L, "mu"], ncol = 1L), warmup = 0)
```

The full diagnostic is one call away:

```{r pt-diagnose, fig.width = 9, fig.height = 7}
gpum_diagnose(fit_pt)
```

The verdict bar at the top of the table reads parallel tempering, the
diagnostic table summarises the cold chain only, and the extra row of plots
shows the swap acceptance per adjacent pair across the warmup batches with
the asymptotic optimum of $0.234$ drawn as a reference. When the M2
posterior is fully symmetric the swap acceptance sits well above the
optimum, which is harmless: a symmetric target makes most swaps trivial.

## Inspecting the cold chain directly

Column 1 of the draws array is the cold chain, the chain at temperature
$T = 1$, and it is the only chain that targets the actual posterior; the other
columns are auxiliary hot chains that feed it through swaps and are not for
inference. So the post-warmup posterior is the slice `fit_pt$draws[, 1L, ]`. A
histogram makes the two modes plain:

```{r pt-hist}
cold <- fit_pt$draws[, 1L, "mu"]
hist(cold, breaks = 60, freq = FALSE,
     main = "Posterior of mu under parallel tempering (cold chain)",
     xlab = "mu", col = "grey85", border = "white")
abline(v = c(-3, 3), col = "red", lty = 2)
```

A bimodal posterior is also where the credible interval `hdi()` must be read
with care. A single 95% interval here spans the gap between the two modes,
including the low-density region around zero that the chain rarely visits, so
the interval is wide and not a summary of either mode. The lesson is general:
`hdi()` is a one-interval summary and is only faithful when the posterior is
unimodal.

```{r pt-hdi}
hdi(cold, ci = 0.95)         # spans both modes: not a summary of either one
```

A trace shows that the cold chain repeatedly crosses zero, a clear sign of
true mode mixing:

```{r pt-trace}
matplot(fit_pt$draws[, , "mu"], type = "l", lty = 1,
        xlab = "iteration (post-warmup)", ylab = "mu",
        main = "All chains; cold chain in black")
```

## Tuning notes for parallel tempering

When the default ladder is too dense the swap acceptance is high and the
cold chain still mixes well, just with no efficiency lost. When the ladder
is too sparse the swap acceptance drops; the `gpum_diagnose` plot fires a
hint when any adjacent pair averages below $0.10$.

The two practical knobs are `temperatures` and `swap_every`. A wider ladder
reduces the autocorrelation of the cold chain at the cost of additional
parallel-chain effort:

```{r pt-wider, eval = FALSE}
fit_pt_wide <- gpu_metropolis(
  model, data = list(y = y), init = init,
  proposal_sd = 0.15, n_iter = 4000,
  method = "pt",
  temperatures = c(1, 2, 4, 8, 16, 32, 64, 128),
  swap_every = 5,
  seed = 1, backend = "cpu"
)
```

A smaller `swap_every` increases the rate at which the cold chain inherits
new states from the hot chains, often the most cost-effective adjustment on
symmetric multimodal targets.

## What the focused re-run shows in numbers

The accompanying `benchmark/run_m2_pt.R` runs the same cell as above with
twenty replications and compares three adapters: `gpumetropolis-cpu`,
`gpumetropolis-cpu-pt` and `nimble`. Averaged over the twenty replications,
parallel tempering reaches R-hat near $1.00$ where the random-walk
gpumetropolis and `nimble` both stay at R-hat near $62$. The trade is a
$1.7$x wall-clock factor for the host-side swap step and a lower nominal
effective sample size, since each cold-chain draw now carries the
autocorrelation of a true mode-crossing chain rather than the
autocorrelation of a stuck one. The full numbers are recorded as the v0.9
amendment of the experiment protocol.

## When to reach for parallel tempering

The honest rule of thumb: random-walk Metropolis on a unimodal target with
moderate posterior contrast is well served by the default `method = "rwm"`
with adaptive warmup. Parallel tempering pays off when:

1. The posterior is multimodal and the modes are separated by low-density
   regions that random walk cannot cross within the iteration budget.
2. The user can spare a small number of hot chains; the package supports up
   to dozens with no extra coordination.
3. The cost of a $1.5$ to $2$x wall-clock factor is acceptable relative to
   the value of a converged R-hat.

When none of those hold, `rwm` is the right default. The package leaves the
choice with the user through the `method` argument.