---
title: "Bimodal extreme-value data: Gumbel mixture with parallel tempering"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Bimodal extreme-value data: Gumbel mixture with parallel tempering}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>",
                      fig.width = 7, fig.height = 4.2)
Sys.setenv(RAYON_NUM_THREADS = "2")
```

This vignette is the second chapter of the package's worked-case companion.
The first chapter, `m2_parallel_tempering`, used a synthetic separated
bimodal target to show why `method = "pt"` is needed. This chapter takes
the same machinery to a real-world question: when annual extremes come
from two distinct weather regimes, can the package recover both regimes
and decide whether a single Gumbel is enough?

The setting is the textbook one for univariate block maxima: each year
contributes its maximum, and the resulting series is modelled as a
Gumbel or a Generalised Extreme Value distribution (Coles 2001). When
the site sits at the edge of two climatic regimes the same series
carries two families of extremes, say summer-driven and winter-driven
maxima, and a single Gumbel is then the wrong model.

```{r libs}
library(gpumetropolis)
suppressMessages({
  library(evd)
  library(diptest)
})
```

## A real dataset: Oxford annual maximum temperatures

`evd::oxford` records the annual maximum daily temperature in degrees
Fahrenheit at Oxford, England, from 1901 to 1980 (`n = 80`). It is a
standard didactic dataset in `evd` (Stephenson 2002).

```{r oxford-show}
data(oxford)
y_ox <- as.numeric(oxford)
summary(y_ox)
diptest::dip.test(y_ox)
```

The dip test sits right at the conventional boundary, `p` near 0.05.
This is the "tie zone" described in
[`CATALOG_DESIGN.md`](https://github.com/pcbrom/gpumetropolis/blob/main/CATALOG_DESIGN.md):
the test does not clearly reject unimodality and does not comfortably
accept it. The catalogue policy is then to fit both a unimodal
candidate and a mixture challenger and compare by WAIC.

### A single Gumbel as the Tier A candidate

The Gumbel log-density up to an additive constant is
`-(y - mu)/beta - exp(-(y - mu)/beta) - log(beta)`; with the scale
reparametrised as `lb = log(beta)` the parameter is unbounded:

```{r oxford-single}
model_gumbel <- gpum_model(
  loglik = ~ -(y - mu) / exp(lb)
             - exp(-(y - mu) / exp(lb))
             - lb,
  params = c("mu", "lb"), data = "y"
)
fit_single <- gpu_metropolis(
  model_gumbel, data = list(y = y_ox),
  proposal_sd = c(0.5, 0.1),
  init = matrix(c(median(y_ox), log(IQR(y_ox) / 2)),
                nrow = 4, ncol = 2, byrow = TRUE),
  n_iter = 4000, n_chains = 4, seed = 1, backend = "cpu"
)
fit_single
```

The posterior summary recovers `mu` near 83 (degrees Fahrenheit) and
`beta = exp(lb)` near 4.2, the textbook fit. `gpum_diagnose(fit_single)`
opens the convergence panel.

```{r oxford-diagnose, fig.width = 8, fig.height = 6}
gpum_diagnose(fit_single)
```

### Why a Gumbel mixture on Oxford is harder than it looks

When the modes overlap heavily, as they do in Oxford, the mixture
posterior on `(p, mu_1, beta_1, mu_2, beta_2)` is weakly identified:
the data can be reasonably explained by a single Gumbel, so the
posterior has a wide ridge where one component dominates and the
other becomes a "ghost" whose parameters drift. Fitting a mixture on
Oxford without informative priors or hard order constraints produces a
fit whose nominal cold-chain summary is unstable. The catalogue layer
of v0.6.0 will handle this through default priors and an order
constraint; the next section shows the mixture recovery in a regime
where the data carry it cleanly.

## A two-regime simulation, the recovery story

A small simulation captures the structure of a station near a climatic
edge. Each year is in either a cool-extreme regime or a warm-extreme
regime with respective Gumbel parameters; the annual maximum is drawn
from the corresponding distribution.

```{r sim-data, fig.height = 3.5}
set.seed(2026)
rgumbel <- function(n, mu, beta) mu - beta * log(-log(runif(n)))
n_years <- 60
regime  <- rbinom(n_years, size = 1, prob = 0.65)
y <- ifelse(regime == 1,
            rgumbel(n_years, mu = 36, beta = 2.5),
            rgumbel(n_years, mu = 22, beta = 3.0))
hist(y, breaks = 25, freq = FALSE,
     main = "Simulated annual maxima at a climatic-edge station",
     xlab = "annual maximum (deg)", col = "grey85", border = "white")
```

The dip test does not reject unimodality on this sample, even though
by construction the data come from a mixture:

```{r sim-dip}
diptest::dip.test(y)$p.value
```

That is the case the catalogue worries about: the dip test gives the
"unimodal" verdict, yet the data come from a mixture. The decision
between a single Gumbel and a mixture is settled below by WAIC, the
predictive criterion the package computes from the pointwise
log-likelihood, once both models are fitted to this sample.

### The mixture model in the DSL

The 2-component Gumbel mixture log-density is `log(p * f_1(y) + (1 - p)
* f_2(y))`, the canonical log-sum-exp pattern that the DSL handles
through `log`, `exp`, `+`. The reparametrisation maps every parameter
to the real line so random-walk Metropolis can propose freely:

- `eta_p` in `R`, with `p = 1 / (1 + exp(-eta_p))`;
- `lb1`, `lb2` in `R`, with `beta_j = exp(lb_j)`;
- `mu1`, `mu2` in `R`, unconstrained.

```{r mix-model}
model_mix <- gpum_model(
  loglik = ~ log(
      (1 / (1 + exp(-eta_p))) *
        exp(-((y - mu1) / exp(lb1)) - exp(-((y - mu1) / exp(lb1)))) / exp(lb1)
    + (1 - 1 / (1 + exp(-eta_p))) *
        exp(-((y - mu2) / exp(lb2)) - exp(-((y - mu2) / exp(lb2)))) / exp(lb2)
  ),
  params = c("eta_p", "mu1", "lb1", "mu2", "lb2"),
  data   = "y"
)
```

### Parallel tempering on the mixture posterior

The posterior of a mixture has at least two equivalent labellings
(component 1 and component 2 can switch with no change in likelihood),
so the posterior is genuinely multimodal in the parameter space even
when the data marginal is unimodal. Parallel tempering is the right
sampler. The cold chain visits each labelling and the swap step keeps
the chain from sticking to one of them; a post-processing step then
fixes a canonical labelling.

```{r mix-fit}
qs <- quantile(y, c(0.2, 0.8))
init <- cbind(eta_p = rep(0, 8),
              mu1   = rep(qs[1], 8) + rnorm(8, 0, 1),
              lb1   = rep(log(2), 8),
              mu2   = rep(qs[2], 8) + rnorm(8, 0, 1),
              lb2   = rep(log(2), 8))
fit_mix <- gpu_metropolis(
  model_mix, data = list(y = y),
  proposal_sd = c(0.3, 0.4, 0.1, 0.4, 0.1),
  init = init, n_iter = 8000, n_chains = 8,
  method = "pt", seed = 11, backend = "cpu"
)
fit_mix
```

### Settling the choice by WAIC

The promise made above, that a mixture predicts this sample better than a
single Gumbel, is now computed rather than asserted. The single Gumbel is
fitted to the same simulated `y`, and `gpum_waic()` reads the predictive
accuracy of each from its pointwise log-likelihood. Both log-likelihoods are
normalised (the single Gumbel carries its `- lb` term, the mixture its
`/ exp(lb)` factors), so the comparison is valid; lower WAIC is the better
predictor.

```{r mix-waic}
fit_single_y <- gpu_metropolis(
  model_gumbel, data = list(y = y),
  proposal_sd = c(0.5, 0.1),
  init = matrix(c(median(y), log(IQR(y) / 2)), nrow = 4, ncol = 2,
                byrow = TRUE),
  n_iter = 4000, n_chains = 4, seed = 1, backend = "cpu"
)
waic_single <- gpum_waic(fit_single_y, data = list(y = y))
waic_mix    <- gpum_waic(fit_mix, data = list(y = y))
c(single = waic_single$waic, mixture = waic_mix$waic)
```

The mixture has the lower WAIC, the verdict the dip test could not give. The
WAIC is invariant to the label switching of the next section, so it is read
here directly from the cold-chain draws. `gpum_loo()` reports the same ranking
through Pareto-smoothed leave-one-out cross-validation, with the Pareto `k`
diagnostic flagging any observation whose estimate is unreliable.

```{r mix-loo}
if (requireNamespace("loo", quietly = TRUE)) gpum_loo(fit_mix, data = list(y = y))
```

### Identifiability post-processing

The mixture log-density is invariant under swapping component 1 and
component 2 (and simultaneously flipping `eta_p` to `-eta_p`). The
canonical labelling enforced here is `mu1 < mu2`: every cold-chain draw
in which `mu1 > mu2` has its component labels swapped.

```{r relabel}
draws <- fit_mix$draws[, 1L, , drop = TRUE]
swap_idx <- draws[, "mu1"] > draws[, "mu2"]
if (any(swap_idx)) {
  draws[swap_idx, ] <- draws[swap_idx, c("eta_p", "mu2", "lb2", "mu1", "lb1")]
  draws[swap_idx, "eta_p"] <- -draws[swap_idx, "eta_p"]
}
p     <- 1 / (1 + exp(-draws[, "eta_p"]))
mu_cool  <- draws[, "mu1"];  beta_cool <- exp(draws[, "lb1"])
mu_warm  <- draws[, "mu2"];  beta_warm <- exp(draws[, "lb2"])

post <- data.frame(
  parameter = c("p (cool)", "mu_cool", "beta_cool", "mu_warm", "beta_warm"),
  mean      = c(mean(p), mean(mu_cool), mean(beta_cool),
                mean(mu_warm), mean(beta_warm)),
  sd        = c(sd(p), sd(mu_cool), sd(beta_cool),
                sd(mu_warm), sd(beta_warm)),
  truth     = c(0.35, 22, 3.0, 36, 2.5)
)
print(post, row.names = FALSE)
```

Posterior means recover the four Gumbel parameters within a posterior
standard deviation; the mixing weight is slightly underestimated, the
usual signal that for `n = 60` the data identify the modes but not their
exact balance.

### Posterior predictive check

A direct visual check: generate new annual maxima from the fitted mixture and
compare their distribution against the observed data. `gpum_ppc()` draws the
posterior predictive sample, with the family's simulator passed in one line
(pick a component by its weight, then a Gumbel draw), and
`gpum_density_compare()` overlays the densities. Label switching does not
matter here, since the predictive distribution is the same in either labelling.

```{r ppc, fig.width = 7, fig.height = 4}
pp <- gpum_ppc(fit_mix, function(q) {
  if (stats::runif(1) < 1 / (1 + exp(-q["eta_p"]))) {
    q["mu1"] - exp(q["lb1"]) * log(-log(stats::runif(1)))
  } else {
    q["mu2"] - exp(q["lb2"]) * log(-log(stats::runif(1)))
  }
})
gpum_density_compare(y, list(mixture = pp),
                     main = "Posterior predictive vs observed",
                     xlab = "annual maximum (deg)")
```

The two densities overlap; the slight differences at the two peaks are
within the sampling variation of `n = 60` annual maxima.

## Fair comparison with current practice

A fair benchmark of `gpumetropolis` against existing R practice on
extreme-value mixtures asks two distinct questions: how does the single
Gumbel fit compare to the established `evd::fgev` route, and how does
the 2-component Gumbel mixture fit compare to what users would
otherwise reach for. The two answers are quite different and the
package's value proposition is in the second one.

### Single Gumbel: matched by an established MLE pipeline

`evd::fgev` is the standard one-liner for fitting a Gumbel (or full GEV)
by maximum likelihood. On the Oxford series:

```{r evd-single}
fit_evd <- evd::fgev(y_ox, shape = 0)  # shape = 0 -> Gumbel
fit_evd$estimate
sqrt(diag(fit_evd$var.cov))   # standard errors
```

The MLE point estimates of `loc` (`mu`) and `scale` (`beta`) agree with
the Bayesian posterior means of `fit_single` above to within the
posterior standard deviation. The MLE pipeline is faster (a single
optimisation) and is the right choice when the goal is just the point
estimate plus delta-method standard errors. The Bayesian path through
`gpumetropolis` adds: a full posterior, propagation of uncertainty into
predictive quantities (return levels, quantile credible bands), and the
portability of one kernel source across CPU, CUDA and Vulkan when the
log-likelihood is expensive over a large series.

Honest verdict on the single-component case: the established `evd`
pipeline is competitive in accuracy and faster in wall-clock time;
`gpumetropolis` is the right choice when the user already wants a full
Bayesian workflow or wants the same code path for downstream mixture
and copula inference. Neither dominates the other on the Tier A
unimodal Gumbel taken alone.

### Two-component Gumbel mixture: no standard one-liner in R

There is no R package today that fits a `k`-component Gumbel mixture
through a single function call. The closest options are:

| Path | What it covers | Effort |
|---|---|---|
| `evd`, `extRemes`, `ismev`, `eva` | single Gumbel / GEV / GP | low, mature |
| `evmix` | bulk-and-tail hybrid models (Normal bulk plus GP tail) | low for the supported families, not a `k`-component Gumbel mixture |
| `mclust`, `mixR`, `flexmix`, `mixtools` | finite mixtures, almost exclusively of Gaussians | low for Gaussian mixtures, requires custom distribution code for Gumbel |
| `evdbayes` | Bayesian inference for the single GEV via Metropolis | low for single GEV, requires manual extension for mixtures |
| Custom `rstan` / `cmdstanr` | any user-coded mixture, including Gumbel | medium to high; needs Stan code, posterior label-switching handling and warm-up tuning |
| Custom `nimble` | same as Stan with BUGS-style syntax | medium; nimble does not ship `dgumbel`, the user writes a custom distribution |

The honest comparison for the mixture case is therefore against
custom-written Stan or nimble code. `gpumetropolis` offers three things
in that comparison:

1. **One-formula declaration**. The model declaration above is the
   same R formula a user would already write for the log-likelihood;
   the package compiles it to bytecode and runs it on the chosen
   backend. The user does not write Stan or BUGS.
2. **Built-in parallel tempering**. The label-switching multimodality
   of a mixture posterior is the textbook case for tempering; in
   `gpumetropolis` it is `method = "pt"`. In a Stan or nimble workflow
   the user typically reaches for relabelling tricks, ordered
   constraints or hand-coded swap moves.
3. **Backend portability**. The same declaration runs on the CPU,
   CUDA and Vulkan. For an extreme-value series of length `n = 80` the
   GPU is not faster than the CPU; the portability matters when the
   workflow scales to longer series or to many parallel chains.

A fair speed comparison is harder to state in one number because the
mixture pipeline is not standardised. The closest informal benchmark:
running the simulation fit above with `nimble` requires (i) a custom
`dgumbel` distribution, (ii) a careful order constraint on the means,
(iii) a longer warm-up to manage label-switching without tempering.
The user who is willing to write that code gets results comparable to
the `gpumetropolis` mixture in wall-clock; the user who wants a
one-formula path benefits from this package.

The catalogue layer of v0.6.0 (next section) is the design that closes
the remaining gap: `gpum_fit_catalog(y, catalog = c("gumbel", "gumbel_mix2"))`
will rank the two candidates by WAIC and pick the better one, the
practical equivalent of the established workflow for single Gumbel plus
the missing standard for the mixture.

## What is manual today and what becomes automatic in v0.6.0

The whole script above is the v0.3.0 user experience for an
extreme-value mixture: write the DSL formula, reparametrise to
unbounded coordinates, set per-chain initial values that bracket the
modes, pick `method = "pt"`, post-process the labels.

The v0.6.0 release, the marginal auto-selection described in
[`CATALOG_DESIGN.md`](https://github.com/pcbrom/gpumetropolis/blob/main/CATALOG_DESIGN.md),
will collapse the workflow to a single call:

```{r auto-future, eval = FALSE}
# Not available until v0.6.0
result <- gpum_fit_catalog(
  data    = y,
  catalog = c("gumbel", "gumbel_mix2"),
  ranking = "WAIC"
)
plot(result)
```

At that point the catalogue layer chooses the candidates from the data
type, runs the unimodal Gumbel and the 2-component Gumbel mixture in
parallel, ranks them by WAIC, handles the reparametrisation, enforces
the `mu_1 < mu_2` constraint, and opens a posterior predictive plot for
the top model. Until then the workflow above is the honest manual path.

## References

- Coles, S. *An Introduction to Statistical Modeling of Extreme Values*.
  Springer, 2001.
- Stephenson, A. G. evd: Extreme value distributions. *R News*, v. 2,
  n. 2, p. 31-32, 2002.
- Hartigan, J. A., e Hartigan, P. M. The dip test of unimodality.
  *The Annals of Statistics*, v. 13, n. 1, p. 70-84, 1985.
- McLachlan, G., e Peel, D. *Finite Mixture Models*. Wiley, 2000.
- Ray, S., e Lindsay, B. G. The topography of multivariate normal
  mixtures. *The Annals of Statistics*, v. 33, n. 5, p. 2042-2065, 2005.