Bimodal extreme-value data: Gumbel mixture with parallel tempering

This vignette is the second chapter of the package’s worked-case companion. The first chapter, m2_parallel_tempering, used a synthetic separated bimodal target to show why method = "pt" is needed. This chapter takes the same machinery to a real-world question: when annual extremes come from two distinct weather regimes, can the package recover both regimes and decide whether a single Gumbel is enough?

The setting is the textbook one for univariate block maxima: each year contributes its maximum, and the resulting series is modelled as a Gumbel or a Generalised Extreme Value distribution (Coles 2001). When the site sits at the edge of two climatic regimes the same series carries two families of extremes, say summer-driven and winter-driven maxima, and a single Gumbel is then the wrong model.

library(gpumetropolis)
suppressMessages({
  library(evd)
  library(diptest)
})

A real dataset: Oxford annual maximum temperatures

evd::oxford records the annual maximum daily temperature in degrees Fahrenheit at Oxford, England, from 1901 to 1980 (n = 80). It is a standard didactic dataset in evd (Stephenson 2002).

data(oxford)
y_ox <- as.numeric(oxford)
summary(y_ox)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   75.00   83.00   85.00   85.33   88.25   95.00
diptest::dip.test(y_ox)
#> 
#>  Hartigans' dip test for unimodality / multimodality
#> 
#> data:  y_ox
#> D = 0.05625, p-value = 0.0531
#> alternative hypothesis: non-unimodal, i.e., at least bimodal

The dip test sits right at the conventional boundary, p near 0.05. This is the “tie zone” described in CATALOG_DESIGN.md: the test does not clearly reject unimodality and does not comfortably accept it. The catalogue policy is then to fit both a unimodal candidate and a mixture challenger and compare by WAIC.

A single Gumbel as the Tier A candidate

The Gumbel log-density up to an additive constant is -(y - mu)/beta - exp(-(y - mu)/beta) - log(beta); with the scale reparametrised as lb = log(beta) the parameter is unbounded:

model_gumbel <- gpum_model(
  loglik = ~ -(y - mu) / exp(lb)
             - exp(-(y - mu) / exp(lb))
             - lb,
  params = c("mu", "lb"), data = "y"
)
fit_single <- gpu_metropolis(
  model_gumbel, data = list(y = y_ox),
  proposal_sd = c(0.5, 0.1),
  init = matrix(c(median(y_ox), log(IQR(y_ox) / 2)),
                nrow = 4, ncol = 2, byrow = TRUE),
  n_iter = 4000, n_chains = 4, seed = 1, backend = "cpu"
)
fit_single
#> <gpum_fit>
#>   parameters  : mu, lb
#>   method      : rwm
#>   backend     : cpu
#>   chains      : 4
#>   iterations  : 2000 per chain (4000 raw, 2000 adaptive warmup discarded)
#>   accept_rate : 0.242 to 0.266
#>   mu          : posterior mean 83.1812 (sd 0.5041)
#>   lb          : posterior mean 1.4341 (sd 0.0793)

The posterior summary recovers mu near 83 (degrees Fahrenheit) and beta = exp(lb) near 4.2, the textbook fit. gpum_diagnose(fit_single) opens the convergence panel.

gpum_diagnose(fit_single)
#> <gpum_diagnose: Converged>
#>   method rwm, backend cpu, chains 4, iterations 2000 (warmup 2000, adaptive)
#> 
#>  parameter    mean      sd    q2.5     q50   q97.5   Rhat      ESS    MCSE
#>         mu 83.1812 0.50414 82.2337 83.1708 84.1708 1.0037 717.1473 0.01883
#>         lb  1.4341 0.07926  1.2818  1.4319  1.5945 1.0010 966.3131 0.00255
#> 
#>   R-hat below 1.01 and ESS at or above 400 in every parameter.

Why a Gumbel mixture on Oxford is harder than it looks

When the modes overlap heavily, as they do in Oxford, the mixture posterior on (p, mu_1, beta_1, mu_2, beta_2) is weakly identified: the data can be reasonably explained by a single Gumbel, so the posterior has a wide ridge where one component dominates and the other becomes a “ghost” whose parameters drift. Fitting a mixture on Oxford without informative priors or hard order constraints produces a fit whose nominal cold-chain summary is unstable. The catalogue layer of v0.6.0 will handle this through default priors and an order constraint; the next section shows the mixture recovery in a regime where the data carry it cleanly.

A two-regime simulation, the recovery story

A small simulation captures the structure of a station near a climatic edge. Each year is in either a cool-extreme regime or a warm-extreme regime with respective Gumbel parameters; the annual maximum is drawn from the corresponding distribution.

set.seed(2026)
rgumbel <- function(n, mu, beta) mu - beta * log(-log(runif(n)))
n_years <- 60
regime  <- rbinom(n_years, size = 1, prob = 0.65)
y <- ifelse(regime == 1,
            rgumbel(n_years, mu = 36, beta = 2.5),
            rgumbel(n_years, mu = 22, beta = 3.0))
hist(y, breaks = 25, freq = FALSE,
     main = "Simulated annual maxima at a climatic-edge station",
     xlab = "annual maximum (deg)", col = "grey85", border = "white")

The dip test does not reject unimodality on this sample, even though by construction the data come from a mixture:

diptest::dip.test(y)$p.value
#> [1] 0.4702142

That is the case the catalogue worries about: the dip test gives the “unimodal” verdict, yet the data come from a mixture. The decision between a single Gumbel and a mixture is settled below by WAIC, the predictive criterion the package computes from the pointwise log-likelihood, once both models are fitted to this sample.

The mixture model in the DSL

The 2-component Gumbel mixture log-density is log(p * f_1(y) + (1 - p) * f_2(y)), the canonical log-sum-exp pattern that the DSL handles through log, exp, +. The reparametrisation maps every parameter to the real line so random-walk Metropolis can propose freely:

eta_p in R, with p = 1 / (1 + exp(-eta_p));
lb1, lb2 in R, with beta_j = exp(lb_j);
mu1, mu2 in R, unconstrained.

model_mix <- gpum_model(
  loglik = ~ log(
      (1 / (1 + exp(-eta_p))) *
        exp(-((y - mu1) / exp(lb1)) - exp(-((y - mu1) / exp(lb1)))) / exp(lb1)
    + (1 - 1 / (1 + exp(-eta_p))) *
        exp(-((y - mu2) / exp(lb2)) - exp(-((y - mu2) / exp(lb2)))) / exp(lb2)
  ),
  params = c("eta_p", "mu1", "lb1", "mu2", "lb2"),
  data   = "y"
)

Parallel tempering on the mixture posterior

The posterior of a mixture has at least two equivalent labellings (component 1 and component 2 can switch with no change in likelihood), so the posterior is genuinely multimodal in the parameter space even when the data marginal is unimodal. Parallel tempering is the right sampler. The cold chain visits each labelling and the swap step keeps the chain from sticking to one of them; a post-processing step then fixes a canonical labelling.

qs <- quantile(y, c(0.2, 0.8))
init <- cbind(eta_p = rep(0, 8),
              mu1   = rep(qs[1], 8) + rnorm(8, 0, 1),
              lb1   = rep(log(2), 8),
              mu2   = rep(qs[2], 8) + rnorm(8, 0, 1),
              lb2   = rep(log(2), 8))
fit_mix <- gpu_metropolis(
  model_mix, data = list(y = y),
  proposal_sd = c(0.3, 0.4, 0.1, 0.4, 0.1),
  init = init, n_iter = 8000, n_chains = 8,
  method = "pt", seed = 11, backend = "cpu"
)
fit_mix
#> <gpum_fit>
#>   parameters  : eta_p, mu1, lb1, mu2, lb2
#>   method      : pt
#>   backend     : cpu
#>   chains      : 8
#>   iterations  : 4000 per chain (8000 raw, 4000 adaptive warmup discarded)
#>   accept_rate : 0.039 to 0.430
#>   swap accept : pairs (1-7) mean 0.410 to 0.772
#>   eta_p (T=1) : posterior mean -1.2134 (sd 0.3312)
#>   mu1 (T=1)   : posterior mean 22.4242 (sd 1.2085)
#>   lb1 (T=1)   : posterior mean 1.1800 (sd 0.3030)
#>   mu2 (T=1)   : posterior mean 36.0542 (sd 0.3745)
#>   lb2 (T=1)   : posterior mean 0.8072 (sd 0.1330)

Settling the choice by WAIC

The promise made above, that a mixture predicts this sample better than a single Gumbel, is now computed rather than asserted. The single Gumbel is fitted to the same simulated y, and gpum_waic() reads the predictive accuracy of each from its pointwise log-likelihood. Both log-likelihoods are normalised (the single Gumbel carries its - lb term, the mixture its / exp(lb) factors), so the comparison is valid; lower WAIC is the better predictor.

fit_single_y <- gpu_metropolis(
  model_gumbel, data = list(y = y),
  proposal_sd = c(0.5, 0.1),
  init = matrix(c(median(y), log(IQR(y) / 2)), nrow = 4, ncol = 2,
                byrow = TRUE),
  n_iter = 4000, n_chains = 4, seed = 1, backend = "cpu"
)
waic_single <- gpum_waic(fit_single_y, data = list(y = y))
waic_mix    <- gpum_waic(fit_mix, data = list(y = y))
c(single = waic_single$waic, mixture = waic_mix$waic)
#>   single  mixture 
#> 418.0165 361.6134

The mixture has the lower WAIC, the verdict the dip test could not give. The WAIC is invariant to the label switching of the next section, so it is read here directly from the cold-chain draws. gpum_loo() reports the same ranking through Pareto-smoothed leave-one-out cross-validation, with the Pareto k diagnostic flagging any observation whose estimate is unreliable.

if (requireNamespace("loo", quietly = TRUE)) gpum_loo(fit_mix, data = list(y = y))
#> 
#> Computed from 4000 by 60 log-likelihood matrix.
#> 
#>          Estimate   SE
#> elpd_loo   -180.9  7.3
#> p_loo         4.3  0.7
#> looic       361.8 14.6
#> ------
#> MCSE of elpd_loo is 0.0.
#> MCSE and ESS estimates assume independent draws (r_eff=1).
#> 
#> All Pareto k estimates are good (k < 0.7).
#> See help('pareto-k-diagnostic') for details.

Identifiability post-processing

The mixture log-density is invariant under swapping component 1 and component 2 (and simultaneously flipping eta_p to -eta_p). The canonical labelling enforced here is mu1 < mu2: every cold-chain draw in which mu1 > mu2 has its component labels swapped.

draws <- fit_mix$draws[, 1L, , drop = TRUE]
swap_idx <- draws[, "mu1"] > draws[, "mu2"]
if (any(swap_idx)) {
  draws[swap_idx, ] <- draws[swap_idx, c("eta_p", "mu2", "lb2", "mu1", "lb1")]
  draws[swap_idx, "eta_p"] <- -draws[swap_idx, "eta_p"]
}
p     <- 1 / (1 + exp(-draws[, "eta_p"]))
mu_cool  <- draws[, "mu1"];  beta_cool <- exp(draws[, "lb1"])
mu_warm  <- draws[, "mu2"];  beta_warm <- exp(draws[, "lb2"])

post <- data.frame(
  parameter = c("p (cool)", "mu_cool", "beta_cool", "mu_warm", "beta_warm"),
  mean      = c(mean(p), mean(mu_cool), mean(beta_cool),
                mean(mu_warm), mean(beta_warm)),
  sd        = c(sd(p), sd(mu_cool), sd(beta_cool),
                sd(mu_warm), sd(beta_warm)),
  truth     = c(0.35, 22, 3.0, 36, 2.5)
)
print(post, row.names = FALSE)
#>  parameter       mean         sd truth
#>   p (cool)  0.2341972 0.05766285  0.35
#>    mu_cool 22.4241515 1.20853642 22.00
#>  beta_cool  3.4168659 1.16149372  3.00
#>    mu_warm 36.0542177 0.37449097 36.00
#>  beta_warm  2.2615647 0.30441433  2.50

Posterior means recover the four Gumbel parameters within a posterior standard deviation; the mixing weight is slightly underestimated, the usual signal that for n = 60 the data identify the modes but not their exact balance.

Posterior predictive check

A direct visual check: generate new annual maxima from the fitted mixture and compare their distribution against the observed data. gpum_ppc() draws the posterior predictive sample, with the family’s simulator passed in one line (pick a component by its weight, then a Gumbel draw), and gpum_density_compare() overlays the densities. Label switching does not matter here, since the predictive distribution is the same in either labelling.

pp <- gpum_ppc(fit_mix, function(q) {
  if (stats::runif(1) < 1 / (1 + exp(-q["eta_p"]))) {
    q["mu1"] - exp(q["lb1"]) * log(-log(stats::runif(1)))
  } else {
    q["mu2"] - exp(q["lb2"]) * log(-log(stats::runif(1)))
  }
})
gpum_density_compare(y, list(mixture = pp),
                     main = "Posterior predictive vs observed",
                     xlab = "annual maximum (deg)")

The two densities overlap; the slight differences at the two peaks are within the sampling variation of n = 60 annual maxima.

Fair comparison with current practice

A fair benchmark of gpumetropolis against existing R practice on extreme-value mixtures asks two distinct questions: how does the single Gumbel fit compare to the established evd::fgev route, and how does the 2-component Gumbel mixture fit compare to what users would otherwise reach for. The two answers are quite different and the package’s value proposition is in the second one.

Single Gumbel: matched by an established MLE pipeline

evd::fgev is the standard one-liner for fitting a Gumbel (or full GEV) by maximum likelihood. On the Oxford series:

fit_evd <- evd::fgev(y_ox, shape = 0)  # shape = 0 -> Gumbel
fit_evd$estimate
#>       loc     scale 
#> 83.199562  4.157983
sqrt(diag(fit_evd$var.cov))   # standard errors
#> [1] 0.4932118 0.3370603

The MLE point estimates of loc (mu) and scale (beta) agree with the Bayesian posterior means of fit_single above to within the posterior standard deviation. The MLE pipeline is faster (a single optimisation) and is the right choice when the goal is just the point estimate plus delta-method standard errors. The Bayesian path through gpumetropolis adds: a full posterior, propagation of uncertainty into predictive quantities (return levels, quantile credible bands), and the portability of one kernel source across CPU, CUDA and Vulkan when the log-likelihood is expensive over a large series.

Honest verdict on the single-component case: the established evd pipeline is competitive in accuracy and faster in wall-clock time; gpumetropolis is the right choice when the user already wants a full Bayesian workflow or wants the same code path for downstream mixture and copula inference. Neither dominates the other on the Tier A unimodal Gumbel taken alone.

Two-component Gumbel mixture: no standard one-liner in R

There is no R package today that fits a k-component Gumbel mixture through a single function call. The closest options are:

Path	What it covers	Effort
`evd`, `extRemes`, `ismev`, `eva`	single Gumbel / GEV / GP	low, mature
`evmix`	bulk-and-tail hybrid models (Normal bulk plus GP tail)	low for the supported families, not a `k`-component Gumbel mixture
`mclust`, `mixR`, `flexmix`, `mixtools`	finite mixtures, almost exclusively of Gaussians	low for Gaussian mixtures, requires custom distribution code for Gumbel
`evdbayes`	Bayesian inference for the single GEV via Metropolis	low for single GEV, requires manual extension for mixtures
Custom `rstan` / `cmdstanr`	any user-coded mixture, including Gumbel	medium to high; needs Stan code, posterior label-switching handling and warm-up tuning
Custom `nimble`	same as Stan with BUGS-style syntax	medium; nimble does not ship `dgumbel`, the user writes a custom distribution

The honest comparison for the mixture case is therefore against custom-written Stan or nimble code. gpumetropolis offers three things in that comparison:

One-formula declaration. The model declaration above is the same R formula a user would already write for the log-likelihood; the package compiles it to bytecode and runs it on the chosen backend. The user does not write Stan or BUGS.
Built-in parallel tempering. The label-switching multimodality of a mixture posterior is the textbook case for tempering; in gpumetropolis it is method = "pt". In a Stan or nimble workflow the user typically reaches for relabelling tricks, ordered constraints or hand-coded swap moves.
Backend portability. The same declaration runs on the CPU, CUDA and Vulkan. For an extreme-value series of length n = 80 the GPU is not faster than the CPU; the portability matters when the workflow scales to longer series or to many parallel chains.

A fair speed comparison is harder to state in one number because the mixture pipeline is not standardised. The closest informal benchmark: running the simulation fit above with nimble requires (i) a custom dgumbel distribution, (ii) a careful order constraint on the means, (iii) a longer warm-up to manage label-switching without tempering. The user who is willing to write that code gets results comparable to the gpumetropolis mixture in wall-clock; the user who wants a one-formula path benefits from this package.

The catalogue layer of v0.6.0 (next section) is the design that closes the remaining gap: gpum_fit_catalog(y, catalog = c("gumbel", "gumbel_mix2")) will rank the two candidates by WAIC and pick the better one, the practical equivalent of the established workflow for single Gumbel plus the missing standard for the mixture.

What is manual today and what becomes automatic in v0.6.0

The whole script above is the v0.3.0 user experience for an extreme-value mixture: write the DSL formula, reparametrise to unbounded coordinates, set per-chain initial values that bracket the modes, pick method = "pt", post-process the labels.

The v0.6.0 release, the marginal auto-selection described in CATALOG_DESIGN.md, will collapse the workflow to a single call:

# Not available until v0.6.0
result <- gpum_fit_catalog(
  data    = y,
  catalog = c("gumbel", "gumbel_mix2"),
  ranking = "WAIC"
)
plot(result)

At that point the catalogue layer chooses the candidates from the data type, runs the unimodal Gumbel and the 2-component Gumbel mixture in parallel, ranks them by WAIC, handles the reparametrisation, enforces the mu_1 < mu_2 constraint, and opens a posterior predictive plot for the top model. Until then the workflow above is the honest manual path.

References

Coles, S. An Introduction to Statistical Modeling of Extreme Values. Springer, 2001.
Stephenson, A. G. evd: Extreme value distributions. R News, v. 2,
1. 2, p. 31-32, 2002.
Hartigan, J. A., e Hartigan, P. M. The dip test of unimodality. The Annals of Statistics, v. 13, n. 1, p. 70-84, 1985.
McLachlan, G., e Peel, D. Finite Mixture Models. Wiley, 2000.
Ray, S., e Lindsay, B. G. The topography of multivariate normal mixtures. The Annals of Statistics, v. 33, n. 5, p. 2042-2065, 2005.