This chapter is a complete study
on a real data set, exercising every part of the package on one classic
problem. The data are the 272 eruptions of the Old Faithful geyser
recorded in datasets::faithful: for each eruption, its
duration and the waiting time to the next one. Two questions organise
the study, and each calls a different half of the toolkit.
The first question is regular and the tools answer it cleanly: how does the waiting time depend on the eruption duration? A linear-Gaussian regression has a smooth, single-peaked posterior, so it is the setting where Differential Evolution, the Cramer-Rao reference and the joint-posterior plots are at home.
The second question is irregular and teaches where the same tools stop: the eruption durations themselves split into two groups, short and long. A two-component mixture has a multimodal posterior, so it is the setting for parallel tempering and for predictive comparison, and the setting where the Cramer-Rao reference correctly refuses to report a number.
library(gpumetropolis)
d <- faithful
op <- graphics::par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))
plot(d$eruptions, d$waiting, pch = 19, col = "grey50", cex = 0.6,
xlab = "eruption (min)", ylab = "waiting (min)",
main = "Waiting against duration")
hist(d$eruptions, breaks = 25, col = "grey85", border = "white",
main = "Eruption durations", xlab = "eruption (min)")The model is
waiting = a + b * eruptions + Normal(0, sigma). The three
parameters are the intercept a, the slope b
and the log standard deviation ls = log(sigma), the last
unbounded so the sampler proposes freely. The log-likelihood is the
Gaussian density up to the additive constant the sampler does not
need.
reg <- gpum_model(
loglik = ~ -0.5 * ((waiting - a - b * eruptions) / exp(ls))^2 - ls,
params = c("a", "b", "ls"),
data = c("waiting", "eruptions")
)
dat <- list(waiting = d$waiting, eruptions = d$eruptions)The slope and intercept of a regression are strongly correlated in the posterior, the geometry where a diagonal random walk mixes slowly and the Differential Evolution proposal aligns itself for free. We fit both and read the effective sample size to see the difference.
fit_de <- gpu_metropolis(reg, data = dat, n_iter = 8000, n_chains = 16,
method = "de", seed = 1, backend = "cpu")
fit_rwm <- gpu_metropolis(reg, data = dat, n_iter = 8000, n_chains = 16,
method = "rwm", seed = 1, backend = "cpu")
c(rwm = ess(fit_rwm$draws[, , "b"], 0), de = ess(fit_de$draws[, , "b"], 0))
#> rwm de
#> 456.957 5622.360gpum_diagnose() rolls the per-parameter R-hat, effective
sample size and Monte Carlo error into one table with a verdict, and
opens a panel per parameter. Read it as a pass when R-hat is below 1.01
and the effective sample size is above 400 for every parameter.
gpum_diagnose(fit_de)
#> <gpum_diagnose: Converged>
#> method de, gamma 0.972, backend cpu, chains 16, iterations 4000 (warmup 4000, adaptive)
#>
#> parameter mean sd q2.5 q50 q97.5 Rhat ESS MCSE
#> a 33.4355 1.14322 31.2319 33.4226 35.6792 1.0023 5527.0433 0.0153774
#> b 10.7378 0.31039 10.1306 10.7406 11.3388 1.0027 5622.3599 0.0041395
#> ls 1.7790 0.04273 1.6971 1.7783 1.8639 1.0021 5892.5651 0.0005566
#>
#> R-hat below 1.01 and ESS at or above 400 in every parameter.The two samplers target the same posterior, so their draws should be
distributionally the same. ks_equivalence() checks that
with a Kolmogorov-Smirnov test after thinning to the effective sample
size; a large p-value and equivalent = TRUE confirm the
agreement.
The correlation between intercept and slope is the point of a
regression posterior. gpum_pairs() shows the marginals with
their highest density intervals on the diagonal and the joint draws with
their credible-region contour off it; gpum_surface() shows
the same density of a chosen pair as level curves and as a 3D
surface.
This model is regular and Gaussian, so the posterior should match the
information bound. gpum_crlb() forms the observed Fisher
information by a finite-difference Hessian of the log-likelihood and
inverts it; the standard deviations it reports should equal the
frequentist standard errors of lm().
cr <- gpum_crlb(fit_de, data = dat)
cr
#> <gpum_crlb>
#> parameter posterior_sd cramer_rao_sd ratio
#> a 1.14322 1.15682 0.9882
#> b 0.31039 0.31528 0.9845
#> ls 0.04273 0.04311 0.9912
#>
#> posterior correlation -0.951, Cramer-Rao -0.951
#>
#> Frequentist asymptotic reference (inverse Fisher information), prior-free. A weak/flat prior is assumed; an informative prior makes the posterior tighter.
lm_fit <- lm(waiting ~ eruptions, data = d)
rbind(gpum_crlb = cr$crlb_sd[1:2],
lm_se = coef(summary(lm_fit))[, "Std. Error"])
#> (Intercept) eruptions
#> gpum_crlb 1.156818 0.3152827
#> lm_se 1.154874 0.3147534Does the waiting time depend on the duration at all? The decision is
read from the posterior of the slope b.
gpum_hypothesis() gives the posterior probability that
b exceeds zero; hdi() gives its credible
interval; and gpum_rope() asks whether b is
practically different from zero against a region of practical
equivalence, a band of slopes too small to matter.
gpum_hypothesis(fit_de, "b", lower = 0)
#> <gpum_hypothesis>
#> parameter b, interval (0, Inf)
#> P(in interval) = 1.0000
#> P(<= 0) = 0.0000
hdi(as.vector(fit_de$draws[, , "b"]), 0.95)
#> lower upper
#> 10.12772 11.33157
rope_b <- gpum_rope(fit_de, "b", rope = 0.5, null = 0)
rope_b
#> <gpum_rope>
#> parameter b
#> 95% HDI [10.13, 11.33]
#> ROPE [-0.5, 0.5]
#> in ROPE 0.0%
#> decision different: HDI outside the ROPEThe hypothesis above is a statement about the posterior; the Bayes factor is a statement about the evidence, comparing the model with a slope to the model without one. Both need a proper prior, so we attach weak normal priors and compute the marginal likelihood of each by thermodynamic integration.
reg_slope <- gpum_model(
loglik = ~ -0.5 * ((waiting - a - b * eruptions) / exp(ls))^2 - ls,
params = c("a", "b", "ls"), data = c("waiting", "eruptions"),
prior = ~ -0.5 * (a / 80)^2 - 0.5 * (b / 40)^2 - 0.5 * ((ls - 2.3) / 1)^2
)
reg_flat <- gpum_model(
loglik = ~ -0.5 * ((waiting - a) / exp(ls))^2 - ls,
params = c("a", "ls"), data = "waiting",
prior = ~ -0.5 * (a / 80)^2 - 0.5 * ((ls - 2.3) / 1)^2
)
bf <- gpum_bayes_factor(reg_slope, reg_flat, data = dat,
n_rungs = 12L, n_iter = 2500L, n_chains = 8L, seed = 1L)
bf
#> <gpum_bayes_factor>
#> B_10 = 2.524e+96 (log 221.974)
#> log evidence: model1 -633.043, model0 -855.017
#> evidence for model1 over model0: decisive
#> Note: Bayes factors are prior-sensitive (Jeffreys-Lindley); report the priors.The evidence for a slope is overwhelming, as expected from a scatter this clean; the Bayes factor and the posterior probability agree, by different routes.
How does gpumetropolis stand against the samplers
already in use, on this real regression? We fit
waiting ~ eruptions with each and read the effective sample
per second, computed uniformly with coda, and the recovered
slope as the agreement check. Each competitor is guarded, so a missing
one is skipped.
have <- function(p) requireNamespace(p, quietly = TRUE)
ess_ps <- function(dr, secs) as.numeric(coda::effectiveSize(dr)) / max(secs, 0.02)
N <- 30000L; yv <- d$waiting; xv <- d$eruptions; nobs <- length(yv)
rows <- list()
if (have("coda")) {
t <- system.time(fg <- gpu_metropolis(reg, data = dat, n_iter = N, n_chains = 8,
method = "de", seed = 1, backend = "cpu"))["elapsed"]
rows[["gpumetropolis"]] <- c(ess_ps(as.vector(fg$draws[, , "b"]), t), unname(t), mean(fg$draws[, , "b"]))
if (have("MCMCpack")) rows[["MCMCpack"]] <- tryCatch({
t <- system.time(f <- MCMCpack::MCMCregress(waiting ~ eruptions, data = d, mcmc = N, verbose = 0))["elapsed"]
c(ess_ps(f[, "eruptions"], t), unname(t), mean(f[, "eruptions"])) }, error = function(e) rep(NA, 3))
if (have("mcmc")) rows[["mcmc"]] <- tryCatch({
lp <- function(th) sum(dnorm(yv, th[1] + th[2] * xv, exp(th[3]), log = TRUE)) - 0.5 * (th[1] / 80)^2 - 0.5 * (th[2] / 50)^2
t <- system.time(o <- mcmc::metrop(lp, c(34, 11, 1.8), nbatch = N, scale = c(2, 0.6, 0.05)))["elapsed"]
c(ess_ps(o$batch[, 2], t), unname(t), mean(o$batch[, 2])) }, error = function(e) rep(NA, 3))
if (have("nimble")) rows[["nimble"]] <- tryCatch({
suppressMessages(library(nimble))
code <- nimbleCode({ for (i in 1:n) { mu[i] <- a + b * x[i]; y[i] ~ dnorm(mu[i], sd = s) }
a ~ dnorm(0, sd = 80); b ~ dnorm(0, sd = 50); s ~ dunif(0, 50) })
t <- system.time({
md <- nimbleModel(code, constants = list(n = nobs), data = list(y = yv, x = xv),
inits = list(a = 34, b = 11, s = 6)); cm <- compileNimble(md)
mc <- buildMCMC(configureMCMC(md, monitors = c("a", "b", "s")))
cmc <- compileNimble(mc, project = md); sm <- runMCMC(cmc, niter = N, nburnin = 0, progressBar = FALSE)
})["elapsed"]
c(ess_ps(sm[, "b"], t), unname(t), mean(sm[, "b"])) }, error = function(e) rep(NA, 3))
if (have("BayesianTools")) rows[["BayesianTools"]] <- tryCatch({
ll <- function(par) sum(dnorm(yv, par[1] + par[2] * xv, exp(par[3]), log = TRUE))
su <- BayesianTools::createBayesianSetup(ll, lower = c(0, 0, 0), upper = c(80, 25, 3))
t <- system.time(o <- BayesianTools::runMCMC(su, settings = list(iterations = N, message = FALSE)))["elapsed"]
s <- as.matrix(BayesianTools::getSample(o, coda = TRUE))
c(ess_ps(s[, 2], t), unname(t), mean(s[, 2])) }, error = function(e) rep(NA, 3))
if (have("cmdstanr")) rows[["Stan"]] <- tryCatch({
sc <- "data{int n; vector[n] y; vector[n] x;} parameters{real a; real b; real<lower=0> sigma;} model{ y ~ normal(a + b*x, sigma); a ~ normal(0,80); b ~ normal(0,50);}"
mod <- cmdstanr::cmdstan_model(cmdstanr::write_stan_file(sc))
t <- system.time(sf <- mod$sample(data = list(n = nobs, y = yv, x = xv), chains = 1,
iter_sampling = N, iter_warmup = 1000, refresh = 0, show_messages = FALSE))["elapsed"]
dr <- as.numeric(sf$draws("b", format = "draws_matrix")); c(ess_ps(dr, t), unname(t), mean(dr)) }, error = function(e) rep(NA, 3))
}
#> ===== Monitors =====
#> thin = 1: a, b, s
#> ===== Samplers =====
#> RW sampler (1)
#> - s
#> conjugate sampler (2)
#> - a
#> - b
comp <- as.data.frame(do.call(rbind, rows)); names(comp) <- c("ESS_per_sec", "wall_sec", "slope")
comp <- comp[!is.na(comp$ESS_per_sec), ]; comp <- comp[order(-comp$ESS_per_sec), ]
round(comp, 2)
#> ESS_per_sec wall_sec slope
#> MCMCpack 459336.37 0.06 10.73
#> gpumetropolis 3546.46 3.41 10.73
#> mcmc 1221.62 0.62 10.73
#> BayesianTools 648.95 2.49 10.73
#> nimble 73.17 20.64 10.73The verdict mirrors the boundary the package draws honestly: every
sampler recovers the same slope, so they agree; in this regime, a small
data set with one chain, the leader on effective sample per second is
MCMCpack, a specialised conjugate sampler against which a generic random
walk does not claim to win. The advantage of gpumetropolis
is the many-chains and large-data regime of the registered benchmark,
and the portability of one kernel across CPU and GPU. The consolidated
table of what each package delivers is in the case_mtcars
chapter.
The eruption durations are not one population but two, a short group
near two minutes and a long group near four and a quarter. A
two-component Gaussian mixture captures that, with a weight
w = 1 / (1 + exp(-eta)) on the first component and a mean
and log scale for each.
A mixture posterior has two equivalent labellings (the two components can swap with no change in likelihood), so it is genuinely multimodal. Chains started in different labellings stay in them, and the split R-hat is far above one.
init <- cbind(eta = rnorm(8, 0, 0.5),
mu1 = c(rep(2, 4), rep(4.3, 4)), # half the chains in each labelling
ls1 = rep(log(0.3), 8),
mu2 = c(rep(4.3, 4), rep(2, 4)),
ls2 = rep(log(0.4), 8))
fit_mix_rwm <- gpu_metropolis(mix, data = xdat, init = init,
proposal_sd = c(0.2, 0.1, 0.1, 0.1, 0.1),
n_iter = 6000, method = "rwm", seed = 3,
backend = "cpu")
max(vapply(c("mu1", "mu2"),
function(p) rhat(fit_mix_rwm$draws[, , p], 0), numeric(1)))
#> [1] 38.66987Because the posterior is multimodal, the Cramer-Rao reference does
not apply, and gpum_crlb() says so rather than report a
misleading number: its R-hat guard catches the non-convergence here.
Parallel tempering runs each chain at its own temperature and swaps
states between them, so the cold chain crosses between labellings
instead of sticking. That crossing is correct sampling, but it makes the
raw mean of mu1 meaningless: averaged over both labellings
it lands between the two true means. The fix is a canonical labelling,
here mu1 < mu2, applied draw by draw before the
components are read.
fit_mix <- gpu_metropolis(mix, data = xdat, init = init,
proposal_sd = c(0.2, 0.1, 0.1, 0.1, 0.1),
n_iter = 6000, method = "pt", seed = 2,
backend = "cpu")
cold <- fit_mix$draws[, 1L, , drop = TRUE] # cold chain, the posterior
swap <- cold[, "mu1"] > cold[, "mu2"] # enforce mu1 < mu2
lo <- ifelse(swap, cold[, "mu2"], cold[, "mu1"])
hi <- ifelse(swap, cold[, "mu1"], cold[, "mu2"])
round(c(short_component = mean(lo), long_component = mean(hi)), 2)
#> short_component long_component
#> 2.02 4.27The model choice, one Gaussian or two, is a predictive question, and
WAIC and PSIS-LOO answer it without a marginal likelihood. They need the
per-observation log-likelihood normalised, so the single-Gaussian and
the mixture both carry their 1 / sqrt(2 pi) factor here.
The lower value is the better predictor.
single <- gpum_model(
loglik = ~ -0.5 * ((x - mu) / exp(ls))^2 - ls - 0.9189385,
params = c("mu", "ls"), data = "x")
fit_single <- gpu_metropolis(single, data = xdat, n_iter = 6000,
n_chains = 8, method = "de", seed = 1,
backend = "cpu")
mix_norm <- gpum_model(
loglik = ~ log(
(1 / (1 + exp(-eta))) *
exp(-0.5 * ((x - mu1) / exp(ls1))^2) / exp(ls1) / 2.5066283
+ (1 - 1 / (1 + exp(-eta))) *
exp(-0.5 * ((x - mu2) / exp(ls2))^2) / exp(ls2) / 2.5066283),
params = c("eta", "mu1", "ls1", "mu2", "ls2"), data = "x")
fit_mix_norm <- gpu_metropolis(mix_norm, data = xdat, init = init,
proposal_sd = c(0.2, 0.1, 0.1, 0.1, 0.1),
n_iter = 6000, method = "pt", seed = 2,
backend = "cpu")
c(single = gpum_waic(fit_single, xdat)$waic,
mixture = gpum_waic(fit_mix_norm, xdat)$waic)
#> single mixture
#> 846.2264 564.2231
if (requireNamespace("loo", quietly = TRUE)) {
c(single = gpum_loo(fit_single, xdat)$estimates["looic", "Estimate"],
mixture = gpum_loo(fit_mix_norm, xdat)$estimates["looic", "Estimate"])
}
#> single mixture
#> 846.2286 564.2246The two-component model has the much lower WAIC and LOO, the quantitative counterpart of the obvious two peaks in the histogram.
An applied fit is only trustworthy when the distribution it generates
reproduces the data. gpum_ppc() simulates a posterior
predictive sample from a fit given the family’s simulator, and
gpum_density_compare() overlays the observed density with
the generated ones. We check both questions of the study. On the left,
the eruption durations: the two-component mixture reproduces the two
peaks, the single Gaussian cannot. On the right, the waiting times: the
regression’s predictive density tracks the observed one, two peaks
included, because the predictor is itself bimodal.
pp_mix <- gpum_ppc(fit_mix, function(p)
if (stats::runif(1) < 1 / (1 + exp(-p["eta"]))) stats::rnorm(1, p["mu1"], exp(p["ls1"]))
else stats::rnorm(1, p["mu2"], exp(p["ls2"])))
pp_single <- gpum_ppc(fit_single, function(p) stats::rnorm(1, p["mu"], exp(p["ls"])))
pp_wait <- gpum_ppc(fit_de, function(p)
stats::rnorm(1, p["a"] + p["b"] * sample(d$eruptions, 1), exp(p["ls"])))
op <- graphics::par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))
gpum_density_compare(d$eruptions, list(mixture = pp_mix, single = pp_single),
main = "Eruption duration", xlab = "min")
gpum_density_compare(d$waiting, list(regression = pp_wait),
main = "Waiting time", xlab = "min")The regular half of the study used the sampler
method = "de" against the correlated regression geometry,
gpum_diagnose and ks_equivalence for
convergence and sampler agreement, gpum_pairs and
gpum_surface for the joint posterior,
gpum_crlb matched against lm for the
information bound, gpum_hypothesis, hdi and
gpum_rope for the decision on the slope, and
gpum_bayes_factor for the weight of evidence. The irregular
half used method = "pt" to sample the multimodal mixture,
gpum_crlb again to show the reference declining where it
does not apply, and gpum_waic with gpum_loo
for the predictive model choice. One real data set exercised the package
end to end.