This chapter studies a real
regression and the question that follows every regression: is the extra
predictor worth keeping? The data are the 32 cars of
datasets::mtcars; the response is fuel economy
mpg, and the candidate predictors are weight
wt and gross horsepower hp. The study fits the
simple model, reads it with the full inference toolkit, then compares it
against the richer model by both predictive accuracy and weight of
evidence.
library(gpumetropolis)
plot(mtcars$wt, mtcars$mpg, pch = 19, col = "grey45",
xlab = "weight (1000 lb)", ylab = "fuel economy (mpg)",
main = "Fuel economy against weight")The model is mpg = a + b * wt + Normal(0, sigma), with
the log standard deviation ls unbounded. The slope and
intercept are strongly correlated in the posterior, which is the
geometry the Differential Evolution sampler handles without tuning.
m_wt <- gpum_model(
loglik = ~ -0.5 * ((mpg - a - b * wt) / exp(ls))^2 - ls,
params = c("a", "b", "ls"), data = c("mpg", "wt"))
dat <- list(mpg = mtcars$mpg, wt = mtcars$wt, hp = mtcars$hp)
fit_wt <- gpu_metropolis(m_wt, data = dat, n_iter = 8000, n_chains = 16,
method = "de", seed = 1, backend = "cpu")
gpum_diagnose(fit_wt, plot = FALSE)
#> <gpum_diagnose: Converged>
#> method de, gamma 0.972, backend cpu, chains 16, iterations 4000 (warmup 4000, adaptive)
#>
#> parameter mean sd q2.5 q50 q97.5 Rhat ESS MCSE
#> a 37.2797 1.9113 33.4652 37.3076 41.0464 1.0021 5779.9860 0.025140
#> b -5.3444 0.5681 -6.4626 -5.3503 -4.2258 1.0025 5909.2347 0.007390
#> ls 1.1286 0.1281 0.8914 1.1237 1.3953 1.0025 5540.6117 0.001721
#>
#> R-hat below 1.01 and ESS at or above 400 in every parameter.The joint posterior of intercept and slope is correlated;
gpum_pairs() shows it with the credible-region contour, and
gpum_crlb() checks the spread against the Cramer-Rao bound,
which for this regular Gaussian model closely matches the standard
errors of lm().
lm_wt <- lm(mpg ~ wt, data = mtcars)
rbind(gpum_crlb = cr$crlb_sd[1:2],
lm_se = coef(summary(lm_wt))[, "Std. Error"])
#> (Intercept) wt
#> gpum_crlb 1.905595 0.5674287
#> lm_se 1.877627 0.5591010The slope answers the scientific question.
gpum_hypothesis() gives the posterior probability that
heavier cars are less economical (b < 0),
hdi() its interval, and gpum_rope() whether
the effect is practically nonzero against a region of slopes too small
to matter, here a tenth of an mpg per 1000 lb.
gpum_hypothesis(fit_wt, "b", upper = 0) # P(b < 0): heavier means thirstier
#> <gpum_hypothesis>
#> parameter b, interval (-Inf, 0)
#> P(in interval) = 1.0000
#> P(>= 0) = 0.0000
hdi(as.vector(fit_wt$draws[, , "b"]), 0.95)
#> lower upper
#> -6.451410 -4.220682
rope_b <- gpum_rope(fit_wt, "b", rope = 0.1, null = 0)
rope_b
#> <gpum_rope>
#> parameter b
#> 95% HDI [-6.451, -4.221]
#> ROPE [-0.1, 0.1]
#> in ROPE 0.0%
#> decision different: HDI outside the ROPEThe richer model adds horsepower:
mpg = a + b1 * wt + b2 * hp + Normal(0, sigma). WAIC and
PSIS-LOO compare the two by out-of-sample predictive accuracy, with no
marginal likelihood. The log-likelihoods carry the same normalising
constant, so the comparison is valid; the lower value predicts
better.
m_wthp <- gpum_model(
loglik = ~ -0.5 * ((mpg - a - b1 * wt - b2 * hp) / exp(ls))^2 - ls - 0.9189385,
params = c("a", "b1", "b2", "ls"), data = c("mpg", "wt", "hp"))
m_wt_n <- gpum_model(
loglik = ~ -0.5 * ((mpg - a - b * wt) / exp(ls))^2 - ls - 0.9189385,
params = c("a", "b", "ls"), data = c("mpg", "wt"))
fit_wt_n <- gpu_metropolis(m_wt_n, data = dat, n_iter = 8000, n_chains = 16,
method = "de", seed = 1, backend = "cpu")
fit_wthp <- gpu_metropolis(m_wthp, data = dat, n_iter = 8000, n_chains = 16,
method = "de", seed = 1, backend = "cpu")
c(wt = gpum_waic(fit_wt_n, dat)$waic, wt_hp = gpum_waic(fit_wthp, dat)$waic)
#> wt wt_hp
#> 166.5126 157.2182
if (requireNamespace("loo", quietly = TRUE)) {
c(wt = gpum_loo(fit_wt_n, dat)$estimates["looic", "Estimate"],
wt_hp = gpum_loo(fit_wthp, dat)$estimates["looic", "Estimate"])
}
#> wt wt_hp
#> 166.6655 157.6015The Bayes factor weighs the evidence directly, comparing the two-predictor model against the one-predictor model. Both carry weak proper priors, and the marginal likelihood of each is estimated by thermodynamic integration.
m_wthp_p <- gpum_model(
loglik = ~ -0.5 * ((mpg - a - b1 * wt - b2 * hp) / exp(ls))^2 - ls,
params = c("a", "b1", "b2", "ls"), data = c("mpg", "wt", "hp"),
prior = ~ -0.5 * (a / 40)^2 - 0.5 * (b1 / 20)^2 - 0.5 * (b2 / 1)^2
- 0.5 * ((ls - 1) / 1)^2)
m_wt_p <- gpum_model(
loglik = ~ -0.5 * ((mpg - a - b * wt) / exp(ls))^2 - ls,
params = c("a", "b", "ls"), data = c("mpg", "wt"),
prior = ~ -0.5 * (a / 40)^2 - 0.5 * (b / 20)^2 - 0.5 * ((ls - 1) / 1)^2)
bf <- gpum_bayes_factor(m_wthp_p, m_wt_p, data = dat,
n_rungs = 12L, n_iter = 2500L, n_chains = 8L, seed = 1L)
bf
#> <gpum_bayes_factor>
#> B_10 = 4.272 (log 1.452)
#> log evidence: model1 -59.796, model0 -61.248
#> evidence for model1 over model0: substantial
#> Note: Bayes factors are prior-sensitive (Jeffreys-Lindley); report the priors.The two criteria can disagree in emphasis: WAIC and LOO reward predictive accuracy, while the Bayes factor, sensitive to the prior, penalises the added parameter unless the data demand it. Reading them together is the honest summary: horsepower carries some predictive signal beyond weight, and how decisively depends on whether the goal is prediction or a weight of evidence under a stated prior.
The fitted models should generate fuel economies that look like the observed ones. The posterior-predictive densities of both models, overlaid on the observed density, confirm that both reproduce the data well, the two predictive curves nearly coinciding. That near-coincidence is the point: when two models generate almost the same distribution, the eye cannot rank them, and the predictive criteria of the previous section, WAIC and LOO, are what separate them.
gpum_ppc() simulates from each fitted model, drawing a
random car for each predictive draw so the covariates match the data,
and gpum_density_compare() overlays the observed and
generated densities.
pp_wt <- gpum_ppc(fit_wt_n, function(p) {
r <- sample(nrow(mtcars), 1)
stats::rnorm(1, p["a"] + p["b"] * mtcars$wt[r], exp(p["ls"]))
})
pp_wthp <- gpum_ppc(fit_wthp, function(p) {
r <- sample(nrow(mtcars), 1)
stats::rnorm(1, p["a"] + p["b1"] * mtcars$wt[r] + p["b2"] * mtcars$hp[r], exp(p["ls"]))
})
gpum_density_compare(mtcars$mpg, list(wt = pp_wt, `wt + hp` = pp_wthp),
main = "Fuel economy: observed against generated",
xlab = "mpg")The honest question an applied user asks is how this package stands
against the ones already in use. We fit the same simple model,
mpg ~ wt, with each of the established Bayesian samplers
and with gpumetropolis, and read off two things: whether
they agree on the answer, and how much effective sample each delivers
per second. The effective sample size is computed uniformly with
coda so the estimator is not a confounder; the slope
recovered by each is the agreement check.
have <- function(p) requireNamespace(p, quietly = TRUE)
# Effective sample per second; max(secs, ...) keeps the ratio stable when a
# very fast sampler runs near the timer resolution.
ess_ps <- function(draws, secs) as.numeric(coda::effectiveSize(draws)) / max(secs, 0.02)
N <- 30000L; yv <- mtcars$mpg; xw <- mtcars$wt; nobs <- length(yv)
rows <- list()
if (have("coda")) {
t <- system.time(fg <- gpu_metropolis(m_wt, data = list(mpg = yv, wt = xw),
n_iter = N, n_chains = 8, method = "de", seed = 1, backend = "cpu"))["elapsed"]
rows[["gpumetropolis"]] <- c(ess_per_sec = ess_ps(as.vector(fg$draws[, , "b"]), t),
wall_sec = unname(t), slope = mean(fg$draws[, , "b"]))
if (have("MCMCpack")) rows[["MCMCpack"]] <- tryCatch({
t <- system.time(f <- MCMCpack::MCMCregress(mpg ~ wt, data = mtcars, mcmc = N, verbose = 0))["elapsed"]
c(ess_ps(f[, "wt"], t), unname(t), mean(f[, "wt"])) }, error = function(e) rep(NA, 3))
if (have("mcmc")) rows[["mcmc"]] <- tryCatch({
lp <- function(th) sum(dnorm(yv, th[1] + th[2] * xw, exp(th[3]), log = TRUE)) -
0.5 * (th[1] / 50)^2 - 0.5 * (th[2] / 50)^2
t <- system.time(o <- mcmc::metrop(lp, c(20, -5, 1), nbatch = N, scale = c(2, 0.6, 0.1)))["elapsed"]
c(ess_ps(o$batch[, 2], t), unname(t), mean(o$batch[, 2])) }, error = function(e) rep(NA, 3))
if (have("nimble")) rows[["nimble"]] <- tryCatch({
suppressMessages(library(nimble))
code <- nimbleCode({ for (i in 1:n) { mu[i] <- a + b * x[i]; y[i] ~ dnorm(mu[i], sd = s) }
a ~ dnorm(0, sd = 50); b ~ dnorm(0, sd = 50); s ~ dunif(0, 50) })
t <- system.time({
md <- nimbleModel(code, constants = list(n = nobs),
data = list(y = yv, x = xw), inits = list(a = 20, b = -5, s = 3))
cm <- compileNimble(md)
mc <- buildMCMC(configureMCMC(md, monitors = c("a", "b", "s")))
cmc <- compileNimble(mc, project = md)
sm <- runMCMC(cmc, niter = N, nburnin = 0, progressBar = FALSE)
})["elapsed"]
c(ess_ps(sm[, "b"], t), unname(t), mean(sm[, "b"])) }, error = function(e) rep(NA, 3))
if (have("BayesianTools")) rows[["BayesianTools"]] <- tryCatch({
ll <- function(par) sum(dnorm(yv, par[1] + par[2] * xw, exp(par[3]), log = TRUE))
su <- BayesianTools::createBayesianSetup(ll, lower = c(0, -15, -2), upper = c(40, 5, 3))
t <- system.time(o <- BayesianTools::runMCMC(su, settings = list(iterations = N, message = FALSE)))["elapsed"]
s <- as.matrix(BayesianTools::getSample(o, coda = TRUE))
c(ess_ps(s[, 2], t), unname(t), mean(s[, 2])) }, error = function(e) rep(NA, 3))
if (have("cmdstanr")) rows[["Stan"]] <- tryCatch({
sc <- "data{int n; vector[n] y; vector[n] x;} parameters{real a; real b; real<lower=0> sigma;} model{ y ~ normal(a + b*x, sigma); a ~ normal(0,50); b ~ normal(0,50);}"
mod <- cmdstanr::cmdstan_model(cmdstanr::write_stan_file(sc))
t <- system.time(sf <- mod$sample(data = list(n = nobs, y = yv, x = xw), chains = 1,
iter_sampling = N, iter_warmup = 1000, refresh = 0, show_messages = FALSE))["elapsed"]
dr <- as.numeric(sf$draws("b", format = "draws_matrix"))
c(ess_ps(dr, t), unname(t), mean(dr)) }, error = function(e) rep(NA, 3))
}
#> ===== Monitors =====
#> thin = 1: a, b, s
#> ===== Samplers =====
#> RW sampler (1)
#> - s
#> conjugate sampler (2)
#> - a
#> - b
comp <- as.data.frame(do.call(rbind, rows))
names(comp) <- c("ESS_per_sec", "wall_sec", "slope")
comp <- comp[!is.na(comp$ESS_per_sec), ]
comp <- comp[order(-comp$ESS_per_sec), ]
round(comp, 2)
#> ESS_per_sec wall_sec slope
#> MCMCpack 750000.00 0.04 -5.34
#> gpumetropolis 5953.83 2.10 -5.34
#> mcmc 2666.69 0.25 -5.32
#> BayesianTools 366.09 2.45 -5.24
#> nimble 55.45 23.61 -5.33Two readings. First, every sampler recovers the same slope near the
least-squares value, so they agree on the inference; the comparison is
about cost, not correctness. Second, in this regime, a small data set
fit with one chain, the leader on effective sample per second is
MCMCpack, a specialised or gradient sampler, with
gpumetropolis competitive but not ahead. That is the honest
boundary the package draws: it does not claim to beat a tuned
single-chain sampler on a small regular model. Its advantage is the
many-chains and large-data regime, where the registered benchmark of
EXPERIMENT_PROTOCOL.md records it reaching tens to hundreds
of times the effective sample per second of the same competitors, and
the only one to scale to thousands of chains. Time to first sample is a
second axis the per-second number hides: nimble compiles a
tailored sampler for each model, a cost that dominates its wall-clock on
a small fit and can keep it out of the live table of a non-interactive
build, while its registered benchmark numbers, where that cost is
amortised over long runs, are in
EXPERIMENT_PROTOCOL.md.
What each package delivers, beyond this one number:
| Package | What it delivers |
|---|---|
gpumetropolis |
one formula running on CPU, CUDA and Vulkan; scales to thousands of chains; strongest in the many-chains, expensive-density regime |
| MCMCpack | fast canned Gibbs samplers for standard models like regression; little to tune; limited to its catalogue |
| mcmc | a general random-walk metrop() over any log-density;
light and simple |
| nimble | a full BUGS-style modelling language with many algorithms; compiles each model, so high time-to-first-sample |
| BayesianTools | flexible likelihood-based samplers with strong diagnostics; aimed at expensive models |
| Stan | gradient samplers (HMC/NUTS) that dominate differentiable models in high dimension, after a model compile |
A real regression carried the regular toolkit,
method = "de", gpum_diagnose,
gpum_pairs, gpum_crlb matched to
lm, and the decision verbs gpum_hypothesis,
hdi and gpum_rope on the slope, and then put
the model-comparison verbs to work, gpum_waic and
gpum_loo for predictive accuracy and
gpum_bayes_factor for the weight of evidence, on the
concrete question of whether to keep a second predictor.