--- title: "Getting started with gpumetropolis" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting started with gpumetropolis} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") # Cap the native CPU backend's Rayon thread pool at two threads while the # vignette is built: R CMD check farms allow at most two cores. Sys.setenv(RAYON_NUM_THREADS = "2") ``` `gpumetropolis` is a generic random-walk Metropolis-Hastings sampler. A model is declared by writing its log-likelihood as an ordinary R formula; the package compiles that formula to a portable kernel that runs on the CPU and on the GPU from one source. The sampler advances many independent chains in one batched pass. This vignette walks through the workflow on the CPU backend, which every build provides. The CUDA and Vulkan backends are optional and are noted where they apply. ```{r library} library(gpumetropolis) ``` ## Declaring a model A model is declared with `gpum_model()`. The argument `loglik` is a one-sided formula giving the per-observation log-likelihood, up to an additive constant, in the parameter and data names. The example below is the Gaussian mean with a known standard deviation of 2 and a flat prior on the mean. Its posterior is available in closed form, which makes it a clean check that the sampler recovers a known quantity. ```{r model} model <- gpum_model( loglik = ~ -((y - mu)^2) / 8, # sigma = 2, so 0.5 / sigma^2 = 1 / 8 params = "mu", data = "y" ) model ``` The operations allowed in a formula are `+`, `-`, `*`, `/`, `^`, unary `-`, and `exp`, `log`, `sqrt`. A symbol that is neither a declared parameter nor a data column, or a function outside this set, is rejected at compile time with a clear message. ## Running the sampler `gpu_metropolis()` runs the sampler. The data is passed as a named list, one entry per declared data column. `proposal_sd` is the standard deviation of the Gaussian random-walk proposal and should be tuned to the scale of the posterior; here the posterior standard deviation is about `2 / sqrt(n)`. ```{r run} set.seed(1) y <- rnorm(5000, mean = 3.4, sd = 2) fit <- gpu_metropolis( model, data = list(y = y), proposal_sd = 0.06, n_iter = 2000, n_chains = 8, seed = 1, backend = "cpu" ) fit ``` The printed summary reports the acceptance rate and the posterior mean and standard deviation of each parameter. The first half of the iterations is discarded as warmup before `fit$draws` is returned; under the default `adapt = TRUE`, that warmup is adaptive: the kernel runs in batches and the per-chain proposal scale moves toward the asymptotic optimum acceptance, so `proposal_sd` is only the seed of the warmup and not a knob to sintonise. Setting `warmup = 0` keeps every iteration for inspecting the burn-in trajectory; `adapt = FALSE` falls back to the trim-only warmup of 0.1.x. ## The fit object The draws are in `fit$draws`, an array of `n_iter - warmup` by `n_chains` by `n_params`, with the parameters named on the third dimension. The raw count, the warmup count and the kept count are recorded as `fit$n_iter_total`, `fit$warmup` and `fit$n_iter`. ```{r draws} dim(fit$draws) mu_draws <- fit$draws[, , "mu"] # iterations (post-warmup) by chains ``` The closed-form posterior of this model is available through `gaussian_mean_posterior()`, so the sampler can be checked against the exact answer. ```{r truth} truth <- gaussian_mean_posterior(y, sigma = 2) c(sampler_mean = mean(mu_draws), exact_mean = truth$mean) c(sampler_sd = sd(as.vector(mu_draws)), exact_sd = truth$sd) ``` ## Convergence diagnostics `gpum_diagnose(fit)` is the one-call diagnostic: a per-parameter table with mean, standard deviation, the 2.5%, 50% and 97.5% quantiles, the split R-hat, the effective sample size and the Monte Carlo standard error, plus a convergence verdict from the canonical thresholds (R-hat below 1.01 and ESS at or above 400 in every parameter; Vehtari et al. 2021). When `plot = TRUE`, opens a multi-panel plot per parameter (trace, pooled density, running mean, pooled ACF) and, when the warmup was adaptive, an extra row showing the per-chain acceptance rate by warmup batch with the asymptotic optimum as a reference. ```{r gpum-diagnose, fig.width = 8, fig.height = 6} gpum_diagnose(fit) ``` The individual building blocks are also exported. They accept the post-warmup `n_iter` by `n_chains` matrix of one parameter's draws; when the input is already post-warmup, pass `warmup = 0` to keep all rows for the diagnostic. ```{r diagnostics} rhat(mu_draws, warmup = 0) # split potential scale reduction factor, near 1 at convergence ess(mu_draws, warmup = 0) # effective sample size, summed over chains ``` R-hat near 1 indicates the chains have mixed; the effective sample size reports how many independent draws the autocorrelated output is worth. ## One source, several backends The same model declaration runs on the CPU, CUDA and Vulkan backends. Only the `backend` argument changes. The CUDA and Vulkan backends are optional Cargo features; the lines below run when the package was built with them. ```{r gpu, eval = FALSE} # the same declaration, dispatched to the GPU fit_gpu <- gpu_metropolis( model, data = list(y = y), proposal_sd = 0.06, n_iter = 2000, n_chains = 8, seed = 1, backend = "cuda" ) ``` Equivalence between two MCMC runs is distributional, never bit-exact, because the algorithm is stochastic. `ks_equivalence()` runs a two-sample Kolmogorov-Smirnov test, thinning the pooled draws to the effective sample size first, since that test assumes independent draws. The check below compares two CPU runs from different seeds; the same call compares a CPU run against a GPU run. ```{r equivalence} fit_b <- gpu_metropolis( model, data = list(y = y), proposal_sd = 0.06, n_iter = 2000, n_chains = 8, seed = 2, backend = "cpu" ) ks_equivalence(fit$draws[, , "mu"], fit_b$draws[, , "mu"]) ``` ## A model with more than one parameter The DSL is not restricted to one parameter or to the Gaussian mean. The model below has an unknown mean `mu` and an unknown log standard deviation `ls`; `proposal_sd` then takes one value per parameter. ```{r two-params} set.seed(2) y2 <- rnorm(3000, mean = 2, sd = 1.5) model2 <- gpum_model( loglik = ~ -((y - mu)^2) / (2 * exp(2 * ls)) - ls, params = c("mu", "ls"), data = "y" ) init <- cbind(seq(-1, 5, length.out = 6), seq(-1, 1, length.out = 6)) fit2 <- gpu_metropolis( model2, data = list(y = y2), init = init, proposal_sd = c(0.03, 0.02), n_iter = 4000, seed = 3, backend = "cpu" ) c(mu = mean(fit2$draws[, , "mu"]), ls = mean(fit2$draws[, , "ls"]), log_sd_true = log(1.5)) ``` ## When the GPU helps A GPU does not accelerate every MCMC run. The sequential dependence inside one chain cannot be parallelised. The parallelism comes from two axes: many independent chains, and the data-parallel evaluation of the log-density. A GPU pays off when the log-density is expensive to evaluate, that is over a large data set, or when thousands of chains are run. For a small model with few chains the transfer overhead dominates and the CPU is faster. The `"auto"` backend encodes this rule: it selects the CPU for few chains and a GPU backend for many. The package benchmark in the README characterises the regime in a refutable way.