**Note on notation:** I tried to be consistent and use plain symbols (\(y_1, z, ...\))
for variables, bold symbols (\(\mathbf{y}\)) for vectors and matrices,
\(P(A)\) for the probability of event \(A\) and \(p(y)\) for the density of random variable.

To make things specific, we will use very simple models as examples (those are the same as in the Three ways post). Our first model, \(\mathcal{M}_1\) assumes that the \(K\) data points are independent draws from a standard normal distribution, i.e.:

\[ \mathcal{M}_1 : \mathbf{y} = \{y_1, ... , y_K\} \\ y_i \sim N(0,1) \]

Our second model, \(\mathcal{M}_2\) assumes that the mean of the normal distribution is a free parameter with a normal prior, i.e.:

\[ \mathcal{M}_2: \mathbf{y} = \{y_1, ... , y_K\} \\ y_i \sim N(\alpha, 1) \\ \alpha \sim N(0,2) \]

Now, lets take a simple vector of values to evaluate the models against:

`y <- c(0.5,0.7, -0.4, 0.1)`

One way to define Bayes factor is as the ratio of *evidence* i.e.:

\[ BF_{12} = \frac{P(\mathbf{y} | \mathcal{M}_1)}{P(\mathbf{y} | \mathcal{M}_2)} \]

Where “evidence” is exactly the prior density of the data after integrating out all the parameters.

Our models are simple enough that we can evaluate the evidence analytically — the Three ways post has the math.

For the given dataset we thus obtain \(\log P(\mathbf{y} | \mathcal{M}_1) \simeq -4.131\), \(\log P(\mathbf{y} | \mathcal{M}_2) \simeq -5.452\) and \(BF_{12} \simeq 3.748\).

One of the main results of Fong & Holmes 2020
is that evidence is related to cross-validation (Proposition 2 in the paper)^{1}.
For this to hold, we need to score cross validation using the log posterior predictive density

\[ s_\mathcal{M}(\tilde{y} \mid \mathbf{y}) = \log \int f_\mathcal{M}(\tilde{y}, \theta) \,{\rm d} p(\theta \mid \mathbf{y}, \mathcal{M}) = \log \int_\Theta f_\mathcal{M}(\tilde{y}, \theta) p(\theta \mid \mathbf{y}, \mathcal{M}) \,{\rm d} \theta \]

where \(\mathbf{\theta} \in \Theta\) is the vector of all parameters of the model and \(f_\mathcal{M}(y, \theta)\) is the likelihood of model \(\mathcal{M}\) evaluated for data point \(y\) and parameters \(\theta\). Note that this is the same score as used e.g. in the `loo`

package for Bayesian cross-validation.

We can then define an exhaustive leave-\(J\)-out cross-validation of model \(\mathcal{M}\) with data \(\mathbf{y} = (y_1, ... , y_K)\) as the average of the log predictive densities over all possible held-out dataset of size \(J\):

\[ S^{\rm CV}_\mathcal{M} (\mathbf{y} ; J) = \frac{1}{{K \choose J}} \sum_{t=1}^{{K \choose J }} \frac{1}{J} \sum_{j=1}^{J} s\bigl(\tilde{y}_{j}^{(t)} \;\big|\; y^{(t)}_{1:K-J}\bigr) \]

where \(y^{(t)}_i\) is the \(i\)-the element of \(t\)-th combination of \(J\) elements out of \(K\) and \(\tilde{y}^{(t)}_i\) is the \(i\)-the element of the complement of this combination. Finally, we express the logarithm of evidence as the sum of the cross-validation scores over all possible held-out dataset sizes:

\[ \log P(\mathbf{y} | \mathcal{M}) = \sum_{J=1}^{K} S^{\rm CV}_\mathcal{M} (\mathbf{y} ; J) \]

Note that the formula above holds regardless of the specific way we choose to partition \(\bf{y}\) into individual “data points”. At one extreme, we can treat all the data as a single indivisible element — we then have \(K = 1\) and recover the formula for evidence as the prior predictive probability. We can partition by individual numerical values, but we can also partition by e.g. patients etc.

In all cases, we take the *joint* likelihood \(f_\mathcal{M}(\tilde{y}, \theta)\) to compute the expected log predictive density for each element of the partition. But for each cross-validation fold we then take the *average* of those densities. So a finer partition will do “more averaging” and treat small subsets of data as independent, while a coarser partition will consider the joint dependencies in each element of the partition, and then do “less averaging”.

Finally, the above formula is in most cases ridiculously impractical for actual computation and is therefore primarily of theoretical interest.

Let’s explore how the formula works in code and us start with the \(\mathcal{M_2}\) model (intercept) as that’s more interesting. We will closely follow the formulae. Note that the posterior density \(p(\alpha | \mathbf{y})\) is available analytically and is normal (see the wiki page for derivation).

Since the posterior is normal and the observation model is normal, the posterior predictive density is also normal. The posterior predictive mean is exactly the posterior mean and posterior predictive variance is equal to the sum of observational and posterior variances.

Putting it all together we compute
\(\frac{1}{J} \sum_{j=1}^{J} s\bigl(\tilde{y}_{j}^{(t)} \;\big|\; y^{(t)}_{1:K-J}\bigr)\) in the `cv_score_m2_single`

function:

```
cv_score_m2_single <- function(observed, held_out) {
prior_mean <- 0
prior_sd <- 2
obs_sd <- 1
K <- length(observed)
if(K > 0) {
prior_precision <- prior_sd ^ -2
obs_precision <- obs_sd ^ -2
obs_mean <- mean(observed)
post_precision <- prior_precision + K * obs_precision
post_sd <- sqrt(1/post_precision)
post_mean <- (K * obs_precision * obs_mean + prior_precision * prior_mean) /
post_precision
} else {
post_mean <- prior_mean
post_sd <- prior_sd
}
posterior_pred_sd <- sqrt(post_sd^2 + obs_sd^2)
log_score <- sum(dnorm(
held_out, mean = post_mean, sd = posterior_pred_sd, log = TRUE))
return(log_score / length(held_out))
}
```

Now `cv_score_m2`

loops over all possible combinations of size \(J\) and `log_evidence_m2_cv`

adds it all together:

```
cv_score_m2 <- function(y, J) {
K <- length(y)
combinations <- combn(1:K, J)
res_unscaled <- 0
for(t in 1:ncol(combinations)) {
held_out <- y[combinations[,t]]
observed <- y[setdiff(1:K, combinations[,t])]
res_unscaled <- res_unscaled + cv_score_m2_single(observed, held_out)
}
return(res_unscaled / ncol(combinations))
}
log_evidence_m2_cv <- function(y) {
res <- 0
for(p in 1:length(y)) {
res <- res + cv_score_m2(y, p)
}
return(res)
}
```

We obtain a result that is identical to the direct computation of evidence:

`log_evidence_m2_cv(y)`

`## [1] -5.452067`

For the \(\mathcal{M_1}\) (null) model, we can avoid all this looping because the density of the held-out data does not depend on the data seen so far, so we have

\[ s_1(\tilde{y} \mid \mathbf{y}) = \mathtt{normal\_lpdf}(\tilde{y} | 0, 1) \]

where \(\mathtt{normal\_lpdf}\) is the log of the density function of a normal distribution. Since the cross-validation is exhaustive, then each \(y\) value is held-out the same number of times and since \(S^{\rm{CV}}_\mathcal{M}\) is an average, we have

\[ S^{\rm CV}_\mathcal{M} (\mathbf{y} ; J) = \frac{1}{J}\sum_{i = 1}^K \mathtt{normal\_lpdf}(y_i | 0, 1) \] and the evidence thus is:

\[ \log P(\mathbf{y} | \mathcal{M}) = \sum_{i = 1}^n \mathtt{normal\_lpdf}(y_i | 0, 1) \]

which happens to be exactly the same as the log prior predictive density and matches our expectations:

`sum(dnorm(y, mean = 0, sd = 1, log = TRUE))`

`## [1] -4.130754`

Since we obtained the correct values for evidence, we also obtain the correct value for the Bayes factor.

And interestingly, log Bayes factor is the difference of log-evidence, so it is in this sense an analog to the difference in log predictive density as reported for cross validation by e.g. the `loo`

package.

Some people claim that this connection is a justification for using Bayes factors. Some even claim that if you accept cross-validation as valid you *must* accept Bayes factors as valid. I am personally not very convinced — as already mentioned by Fong & Holmes 2020 the cross-validation scheme we see here is pretty weird. Why would I want to include “leave-all-data out” or “leave-almost-all-data-out” in my cross-validation?

I also agree with Aki Vehtari’s cross-validation FAQ (which is great overall), that the cross-validation scheme you use should be chosen with an eye toward the predictive task you want to handle. If you have a hierarchical model and you expect to never see new groups (e.g. groups are states), leaving out a single observation can make sense. If on the other hand predicting for new groups is essential (e.g. groups are patients), leaving out whole groups is much more reasonable. There’s no such flexibility in Bayes factors.

You say you don’t care about predictions? Well, I subscribe to the view that everything is prediction — i.e. every inference task can be reframed as an equivalent prediction task. Do you want to select a model? Do you want to know the difference between groups? You are implicitly making predictions about future datasets. So I’d suggest you find the prediction task corresponding to your inference goals. Performing well in this task will lead to good inferences and this performance can be often well approximated with cross-validation.

There are also practical considerations: as discussed in the Three ways post, Bayes factors are hard to compute, depend heavily on the choice priors and are hard to interpret. To be fair, cross-validation can be shown to have some issues as the size of the dataset grows to infinity: you need to increase the proportion of held-out data as the data size increases to avoid those (see Grona & Wagenmakers 2019 and the response in Vehtari et al. 2019 for more on this). But I don’t work with datasets that are effectively infinite…

This does not mean that I believe Bayes factor are *never* useful. I can still imagine scenarios where they may have some merit — if you have two well-specified substantive models with tightly constrained priors (e.g. from previous measurements), you have tons of data and you check that you can compute Bayes factors accurately, then they might provide value. I just think very few people are dealing with such situations.

There are also other ways to relate cross-validation to Bayes factors as discussed e.g. in section 6.1.6 of Bernardo & Smith (1994), but those are not the focus of this post.↩︎

`brms`

is a great package. It allows you to put predictors on a lot of things.
Its power is however not absolute — one thing it doesn’t let you directly do is use data to predict variances of random/varying effects. Here we will show pretty general techniques
to hack with `brms`

that let us achieve exactly this goal (and many more).
To be precise, you can use the construct `(1|gr(patient, by = trt))`

which fits a
separate standard deviation for each level of `trt`

, which
is almost the same as using `trt`

as a categorical predictor for the standard deviation.
You however cannot go further and use any other type of predictors here. E.g. the following model is impossible in plain `brms`

:

\[ y_i \sim N\left(\mu_i, \sigma \right) \\ \mu_i = \alpha + \beta x_i + \gamma_{\text{patient}(i)} \\ \gamma_{p} \sim N \left(0, \tau_{\text{treatment}(p)}\right) \\ \tau_t = \alpha^\prime + \beta^\prime x^\prime_t \]

Where \(x\) is a vector of observation-level predictors while \(x^\prime\) is a vector of treatment-level predictors. In between we have patients — each contributing a bunch of observations and the standard deviation of the patient-level random intercepts depends on our treatment-level predictors.

**UPDATE:** *Shortly after publishing this, Ven Popov noted on Stan forums that this type of model is achievable with non-linear formulas, but without extra hacks. So I’ll add the non-linear formula approach below and keep the hacky approach as a lesson how to play with brms.*

Well, it is not completely impossible. Since `brms`

is *immensely* hackable, you can
actually make this work. This blogpost will discuss how to do this. This does not
mean it is a good idea or that you *should* do it. I am just showing that it is
possible and hopefully also showing some general ways to hack with `brms`

.

Also, this type of model is likely to be a bit data-hungry — you need to have enough observations per treatment and enough treatments to be able to estimate \(\tau\) well enough to learn about its predictors.

Let’s set up and get our hands dirty.

```
library(cmdstanr)
library(brms)
library(tidyverse)
library(knitr)
library(bayesplot)
ggplot2::theme_set(cowplot::theme_cowplot())
options(mc.cores = parallel::detectCores(), brms.backend = "cmdstanr")
cache_dir <- "_brms_ranef_cache"
if(!dir.exists(cache_dir)) {
dir.create(cache_dir)
}
```

Note that the way we have setup the model implies that patients are nested within treatments (i.e. that each patient only ever gets one treatment). Since each random effect can only have one prior distribution, this is the easiest way to make sense of the model.

First, we setup the treatment-level predictors in a treatment-level data frame and use those to predict the sds (\(\tau\) above).

```
set.seed(354855)
N <- 500
N_pts <- floor(N / 5)
N_trts <- 10
trt_intercept <- 0
trt_x_b <- 1
trt_data <- data.frame(trt_x = rnorm(N_trts))
# Corresponds to tau in the mathematical model
trt_sd <- exp(trt_intercept + trt_x_b * trt_data$trt_x)
```

Now, we setup the patient-level random effects, with varying sds (corresponding to \(\gamma\) above).

```
patient_treatment <- sample(1:N_trts, size = N_pts, replace = TRUE)
ranef <- rnorm(N_pts, mean = 0, sd = trt_sd[patient_treatment])
```

Finally, we setup the main data frame with multiple observations of each patient.

```
intercept <- 1
x_b <- 0.5
obs_sigma <- 1
base_data <- data.frame(x = rnorm(N),
patient_id = rep(1:N_pts, length.out = N))
base_data$trt_id <- patient_treatment[base_data$patient_id]
base_data_predictor <- intercept + x_b * base_data$x + ranef[base_data$patient_id]
base_data$y <- rnorm(N, mean = base_data_predictor , sd = obs_sigma)
```

As noted by Ven Popov, we can use non-linear `brms`

formulas for this task.
First, we extend the patient-level data with treatment level-data.

```
trt_with_id <- trt_data %>% mutate(trt_id = 1:n())
data_joined <- base_data %>% inner_join(trt_with_id, by = "trt_id")
```

Now, we create a patient-level random intercept, but fix its standard deviation to \(1\). We then create a linear predictor for the variance and multiply the “standardized” random intercept with exp-transformed value of the predictor, giving us a random intercept with the correct standard deviation.

This is how the `brms`

code looks like:

```
# Fix the sd to 1
prior_nl <- prior(constant(1), class='sd', nlpar='patientintercept')
fit_nl <- brm(
# combine the main predictor with the random effect in a non-linear formula
bf(y ~ muy + patientintercept * exp(logmysigma),
# main linear predictor for y (additional predictors go here)
muy ~ x,
# specify the random intercept
patientintercept ~ 0 + (1|patient_id),
# linear predictor for log random effect sd
# (additional predictors for sd go here)
logmysigma ~ trt_x,
nl = T),
prior = prior_nl,
data = data_joined,
file = file.path(cache_dir, "fit_nl.rds"),
file_refit = "on_change"
)
```

We get a decent recovery of the parameters — recall that we simulated data with
`muy_Intercept`

= 1, `logmysigma_Intercept`

= 0, `muy_x`

= 0.5,
`logmysigma_trt_x`

= 1 and `sigma`

= 1.

`summary(fit_nl)`

```
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: y ~ muy + patientintercept * exp(logmysigma)
## muy ~ x
## patientintercept ~ 0 + (1 | patient_id)
## logmysigma ~ trt_x
## Data: data_joined (Number of observations: 500)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Group-Level Effects:
## ~patient_id (Number of levels: 100)
## Estimate Est.Error l-95% CI u-95% CI Rhat
## sd(patientintercept_Intercept) 1.00 0.00 1.00 1.00 NA
## Bulk_ESS Tail_ESS
## sd(patientintercept_Intercept) NA NA
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## muy_Intercept 0.99 0.09 0.80 1.17 1.00 1066
## muy_x 0.54 0.05 0.45 0.64 1.00 5005
## logmysigma_Intercept 0.17 0.12 -0.07 0.39 1.01 678
## logmysigma_trt_x 0.92 0.10 0.74 1.13 1.01 401
## Tail_ESS
## muy_Intercept 1969
## muy_x 3006
## logmysigma_Intercept 1188
## logmysigma_trt_x 829
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 0.96 0.03 0.90 1.03 1.00 3333 2669
##
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

To keep the lessons for future, I am also including a more hacky approach,
that in principle lets you do much more, but is a bit of an overkill here.
The main downside of my approach is that it forces you to completely override
the likelihood and that you have to build the random effect with predicted sigma
in hand-written Stan code. This may mean the benefits of `brms`

are now too small and
you might be better off building the whole thing directly in Stan.

The first problem to solve is that at its core, `brms`

requires us to use a single data
frame as input. But we have a treatment-level data frame and then an observation-level
data frame. We get around this by adding dummy values so that both data
frames have the same columns, binding the together and then using the `subset`

addition term to use different formulas for each. We will also need a dummy outcome
variable for the treatment-level data.

```
combined_data <- rbind(
base_data %>% mutate(
is_trt = FALSE,
trt_x = 0,
trt_y = 0
),
trt_data %>% mutate(
is_trt = TRUE,
trt_id = 0,
patient_id = 0,
y = 0,
x = 0,
trt_y = 0
)
)
```

The main idea for implementation is that we completely overtake the machinery of `brms`

after
the linear predictors are constructed. To do that, we create a custom family
that is empty (i.e. adds nothing to the log likelihood) and use it in our formula.

```
# Build the empty families --- one has just a single parameter and will be used
# for treatment-level sds. The other has mu and sigma parameter will be used
# for the observation model.
empty_for_trt <- custom_family("empty_for_trt", type = "real")
empty_for_obs <- custom_family("empty_for_obs", dpars = c("mu", "sigma"),
links = c("identity", "log"), type = "real", lb = c(NA, 0))
empty_func_stanvar <- stanvar(block = "functions", scode = "
real empty_for_trt_lpdf(real y, real mu) {
return 0;
}
real empty_for_obs_lpdf(real y, real mu, real sigma) {
return 0;
}
")
```

We then take the linear predictions for the sd of the random effects (\(\tau\)) and use it to manually build our random effect values (with non-centered parametrization). We manually add those values to the rest of the linear predictor term and then manually add our desired likelihood.

This will let our final formula to look this way:

```
f <- mvbrmsformula(
brmsformula(y | subset(!is_trt) ~ x, family = empty_for_obs),
brmsformula(trt_y | subset(is_trt) ~ trt_x, family = empty_for_trt),
rescor = FALSE)
```

In this setup, `brms`

will build a bunch of variables for both formulas
that we can access in our Stan code.
Their names will depend on the name of the outcome variables — since our
main outcome is `y`

, relevant variables will be `N_y`

(number of rows
for this outcome), `mu_y`

and `sigma_y`

(distributional parameters for this outcome).

Our dummy outcome is `trt_y`

and the relevant variables will be `N_trty`

and `mu_trty`

,
because `brms`

removes underscores. You can always use `make_stancode`

and
`make_standata`

to see how `brms`

transforms names and input data.

For all this to happen we also need to pass a bunch of extra data via `stanvars`

.

Let us prepare the extra Stan code.

```
# Pass the extra data. We'll take advantage of some already existing data
# variables defined by brms, these include:
# N_y - the number of observation-level data
# N_trty - the number of treatment-level data (and thus the number of treatments)
# we however need to pass the rest of the data for the random effect
data_stanvars <-
stanvar(x = N_pts, block = "data", scode = "int<lower=2> N_pts;") +
stanvar(x = patient_treatment, name = "trt_id", block = "data",
scode = "array[N_pts] int<lower=1, upper=N_trty> trt_id;") +
stanvar(x = base_data$patient_id, name = "patient_id", block = "data",
scode = "array[N_y] int<lower=1, upper=N_pts> patient_id;")
# Raw parameters for the random effects
parameter_stanvar <-
stanvar(block = "parameters", scode = "
vector[N_pts] my_ranef_raw;
")
# Prior - we are using the non-centered parametrization, so it is just N(0,1)
# and we multiply by the sd later.
# Note that current versions of brms compute log-prior in the transformed
# parameters block, so we do it as well.
prior_stanvar <-
stanvar(block = "tparameters",
scode = "lprior += std_normal_lpdf(to_vector(my_ranef_raw));")
# Here is where we add the random effect to the existing predictor values and
# reconstruct the likelihood.
# Once again using the values generated by brms for the predictors in mu_trty,
# mu_y, sigma_y.
likelihood_stanvar <-
stanvar(block = "likelihood", position = "end", scode = "
// New scope to let us introduce new parameters
{
vector[N_trty] trt_sds = exp(mu_trty);
vector[N_pts] my_ranef = my_ranef_raw .* trt_sds[trt_id];
for(n in 1: N_y) {
// Add the needed ranef
real mu_with_ranef = mu_y[n] + my_ranef[patient_id[n]];
// reimplement the likelihood
target += normal_lpdf(Y_y[n] | mu_with_ranef, sigma_y);
}
}
")
predict_ranef_stanvars <- empty_func_stanvar +
data_stanvars +
parameter_stanvar +
prior_stanvar +
likelihood_stanvar
```

This is the complete Stan code generated by `brms`

with our additions:

```
make_stancode(f, data = combined_data,
stanvars = predict_ranef_stanvars)
```

```
## // generated with brms 2.20.4
## functions {
## real empty_for_trt_lpdf(real y, real mu) {
## return 0;
## }
##
## real empty_for_obs_lpdf(real y, real mu, real sigma) {
## return 0;
## }
## }
## data {
## int<lower=1> N; // total number of observations
## int<lower=1> N_y; // number of observations
## vector[N_y] Y_y; // response variable
## int<lower=1> K_y; // number of population-level effects
## matrix[N_y, K_y] X_y; // population-level design matrix
## int<lower=1> Kc_y; // number of population-level effects after centering
## int<lower=1> N_trty; // number of observations
## vector[N_trty] Y_trty; // response variable
## int<lower=1> K_trty; // number of population-level effects
## matrix[N_trty, K_trty] X_trty; // population-level design matrix
## int<lower=1> Kc_trty; // number of population-level effects after centering
## int prior_only; // should the likelihood be ignored?
## int<lower=2> N_pts;
## array[N_pts] int<lower=1, upper=N_trty> trt_id;
## array[N_y] int<lower=1, upper=N_pts> patient_id;
## }
## transformed data {
## matrix[N_y, Kc_y] Xc_y; // centered version of X_y without an intercept
## vector[Kc_y] means_X_y; // column means of X_y before centering
## matrix[N_trty, Kc_trty] Xc_trty; // centered version of X_trty without an intercept
## vector[Kc_trty] means_X_trty; // column means of X_trty before centering
## for (i in 2 : K_y) {
## means_X_y[i - 1] = mean(X_y[ : , i]);
## Xc_y[ : , i - 1] = X_y[ : , i] - means_X_y[i - 1];
## }
## for (i in 2 : K_trty) {
## means_X_trty[i - 1] = mean(X_trty[ : , i]);
## Xc_trty[ : , i - 1] = X_trty[ : , i] - means_X_trty[i - 1];
## }
## }
## parameters {
## vector[Kc_y] b_y; // regression coefficients
## real Intercept_y; // temporary intercept for centered predictors
## real<lower=0> sigma_y; // dispersion parameter
## vector[Kc_trty] b_trty; // regression coefficients
## real Intercept_trty; // temporary intercept for centered predictors
##
## vector[N_pts] my_ranef_raw;
## }
## transformed parameters {
## real lprior = 0; // prior contributions to the log posterior
## lprior += std_normal_lpdf(to_vector(my_ranef_raw));
## lprior += student_t_lpdf(Intercept_y | 3, 0.9, 2.5);
## lprior += student_t_lpdf(sigma_y | 3, 0, 2.5)
## - 1 * student_t_lccdf(0 | 3, 0, 2.5);
## lprior += student_t_lpdf(Intercept_trty | 3, 0, 2.5);
## }
## model {
## // likelihood including constants
## if (!prior_only) {
## // initialize linear predictor term
## vector[N_y] mu_y = rep_vector(0.0, N_y);
## // initialize linear predictor term
## vector[N_trty] mu_trty = rep_vector(0.0, N_trty);
## mu_y += Intercept_y + Xc_y * b_y;
## mu_trty += Intercept_trty + Xc_trty * b_trty;
## for (n in 1 : N_y) {
## target += empty_for_obs_lpdf(Y_y[n] | mu_y[n], sigma_y);
## }
## for (n in 1 : N_trty) {
## target += empty_for_trt_lpdf(Y_trty[n] | mu_trty[n]);
## }
##
## // New scope to let us introduce new parameters
## {
## vector[N_trty] trt_sds = exp(mu_trty);
## vector[N_pts] my_ranef = my_ranef_raw .* trt_sds[trt_id];
## for (n in 1 : N_y) {
## // Add the needed ranef
## real mu_with_ranef = mu_y[n] + my_ranef[patient_id[n]];
## // reimplement the likelihood
## target += normal_lpdf(Y_y[n] | mu_with_ranef, sigma_y);
## }
## }
## }
## // priors including constants
## target += lprior;
## }
## generated quantities {
## // actual population-level intercept
## real b_y_Intercept = Intercept_y - dot_product(means_X_y, b_y);
## // actual population-level intercept
## real b_trty_Intercept = Intercept_trty - dot_product(means_X_trty, b_trty);
## }
##
```

Now, we can compile and fit the model:

```
fit <- brm(
f,
data = combined_data,
stanvars = predict_ranef_stanvars,
file = file.path(cache_dir, "fit.rds"),
file_refit = "on_change")
```

We get a decent recovery of the parameters — recall that we simulated data with
`y_Intercept`

= 1, `trty_Intercept`

= 0, `y_x`

= 0.5,
`trty_trt_x`

= 1 and `sigma_y`

= 1.

`summary(fit)`

```
## Family: MV(empty_for_obs, empty_for_trt)
## Links: mu = identity; sigma = identity
## mu = identity
## Formula: y | subset(!is_trt) ~ x
## trt_y | subset(is_trt) ~ trt_x
## Data: combined_data (Number of observations: 510)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## y_Intercept 0.99 0.09 0.80 1.17 1.01 1009 1762
## trty_Intercept 0.16 0.12 -0.07 0.38 1.00 683 1556
## y_x 0.54 0.05 0.45 0.64 1.00 5382 3364
## trty_trt_x 0.92 0.10 0.73 1.13 1.01 329 846
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma_y 0.96 0.03 0.90 1.03 1.00 4061 3420
##
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

Unfortunately, the code above is somewhat fragile. Notably, if we add a predictor
for the standard deviation of the observations, then `sigma_y`

in the Stan code
won’t be a scalar, but a vector and we’ll need to adjust the Stan code a little bit.

Since we overtook so much of `brms`

machinery, things like `posterior_predict()`

,
`posterior_epred()`

and `log_lik()`

won’t work out of the box and we need a little extra work to
get them, mirroring the extra steps we did in the Stan code.

Luckily for us `brms`

exposes the `prepare_predictions()`

and `get_dpar()`

functions
which do most of the heavy lifting. Let’s start with mimicking `posterior_epred()`

```
pred_trt <- prepare_predictions(fit, resp = "trty")
# A matrix of 4000 draws per 10 treatments
samples_trt_mu <- brms::get_dpar(pred_trt, "mu")
pred_y <- prepare_predictions(fit, resp = "y")
# A matrix of 4000 draws per 500 observations
samples_mu <- brms::get_dpar(pred_y, "mu")
# the ranef samples need to be taken directly from the Stan fit
# A matrix of 4000 draws per 100 patients
samples_ranef_raw <- posterior::as_draws_matrix(fit$fit) %>%
posterior::subset_draws(variable = "my_ranef_raw")
samples_sigma_per_patient <- exp(samples_trt_mu)[, patient_treatment]
samples_ranef <- samples_ranef_raw * samples_sigma_per_patient
samples_ranef_per_obs <- samples_ranef[, base_data$patient_id]
samples_epred <- samples_mu + samples_ranef_per_obs
```

And once we have the predictions for `mu`

we can combine them with samples
for `sigma`

to get predictions including the observation noise and continue
to do a posterior predictive check (which looks good).

```
# A vector of 4000 draws
samples_sigma <- brms::get_dpar(pred_y, "sigma")
pred_y <- matrix(nrow = nrow(samples_epred), ncol = ncol(samples_epred))
for(j in 1:ncol(samples_epred)) {
pred_y[,j] <- rnorm(nrow(samples_epred),
mean = samples_epred[,j],
sd = samples_sigma)
}
bayesplot::ppc_dens_overlay(base_data$y, pred_y[sample.int(4000, size = 30),])
```

So yay, we can use `brms`

for the core of our model and then extend it to
cover predictors for the standard deviation of random effects. Unfortunately,
it requires quite a bit of extra work.
Using this heavy machinery for such a simple model as we did in this quick
example is probably an overkill
and you would be better off just implementing the whole thing in Stan. But if your
current `brms`

model is quite complex and the only extra thing you need are the
sd predictors, then the cost-benefit considerations might be quite different.

The techniques we used to hack around `brms`

are also very general, note that
we have shown how to:

- Combine multiple datasets of different sizes/shapes in a single model
- Replace likelihood with arbitrary Stan code

Together this is enough to use `brms`

-style predictors in connection with
basically any type of model.
For example, these tricks power my implementation of hidden Markov models with `brms`

discussed at https://discourse.mc-stan.org/t/fitting-hmms-with-time-varying-transition-matrices-using-brms-a-prototype/19645/7 .

Recovering parameters from a single simulation and a nice posterior predictive
check are good starting points but far from a guarantee that we implemented the
model correctly. To be sure, we’ll check with SBC — if you are not familiar,
SBC is a method that can discover almost all implementation problems in your model
by repeatedly fitting simulated data.
We’ll use the `SBC`

R package and won’t
explain all the details here — check the Getting started and SBC for `brms`

vignettes for
explanation of the main concepts and API.

```
# Setting up SBC and paralellism
library(SBC)
future::plan(future::multisession)
gamma_shape <- 14
gamma_rate <- 4
trt_intercept_prior_mu <- 0.5
trt_intercept_prior_sigma <- 0.5
```

To make the model work with SBC we add explicit priors for all parameters (as the simulations need to match those priors). We’ll use \(N(0,1)\) for most parameters except the intercept for random effect deviations (\(\alpha^\prime\)) where we’ll use \(N(0.5,0.5)\) to avoid both very low and very large standard deviations which pose convergence problems. Similarly, very low observation sigma causes convergence problems, so we’ll use a \(\Gamma(14, 4)\) prior (roughly saying that a priori the standard deviation is unlikely to be less than 1.9 or more than 5.6 ). I did not investigate deeply to understand the convergence issues, so not completely sure about the mechanism.

`get_prior(f, combined_data)`

```
## prior class coef group resp dpar nlpar lb ub
## (flat) b
## (flat) Intercept
## (flat) b trty
## (flat) b trt_x trty
## student_t(3, 0, 2.5) Intercept trty
## (flat) b y
## (flat) b x y
## student_t(3, 0.9, 2.5) Intercept y
## student_t(3, 0, 2.5) sigma y 0 <NA>
## source
## default
## default
## default
## (vectorized)
## default
## default
## (vectorized)
## default
## default
```

```
priors <- c(
set_prior("normal(0,1)", class = "b", resp = "trty"),
set_prior(paste0("normal(",trt_intercept_prior_mu, ", ",
trt_intercept_prior_sigma, ")"),
class = "Intercept", resp = "trty"),
set_prior("normal(0,1)", class = "b", resp = "y"),
set_prior("normal(0,1)", class = "Intercept", resp = "y"),
set_prior(paste0("gamma(", gamma_shape, ", ", gamma_rate, ")"),
class = "sigma", resp = "y")
)
# Function to generate a single simulated dataset
# Note: we reuse N_trts, N, N_pts, patient_id and patient_treatment
# from the previous code to keep the data passed via stanvars fixed.
generator_func <- function() {
trt_intercept <- rnorm(1, mean = trt_intercept_prior_mu,
sd = trt_intercept_prior_sigma)
trt_x_b <- rnorm(1)
trt_data <- data.frame(trt_x = rnorm(N_trts))
# Centering predictors to match brms
trt_data$trt_x <- trt_data$trt_x - mean(trt_data$trt_x)
trt_sd <- exp(trt_intercept + trt_x_b * trt_data$trt_x)
ranef_raw <- rnorm(N_pts)
ranef <- ranef_raw * trt_sd[patient_treatment]
intercept <- rnorm(1)
x_b <- rnorm(1)
obs_sigma <- rgamma(1, gamma_shape, gamma_rate)
obs_data <- data.frame(x = rnorm(N),
patient_id = base_data$patient_id)
obs_data$x <- obs_data$x - mean(obs_data$x)
obs_data$trt_id <- patient_treatment[obs_data$patient_id]
obs_data_predictor <- intercept + x_b * obs_data$x + ranef[obs_data$patient_id]
obs_data$y <- rnorm(N, mean = obs_data_predictor , sd = obs_sigma)
combined_data <- rbind(
obs_data %>% mutate(
is_trt = FALSE,
trt_x = 0,
trt_y = 0
),
trt_data %>% mutate(
is_trt = TRUE,
trt_id = 0,
patient_id = 0,
y = 0,
x = 0,
trt_y = 0
)
)
list(generated = combined_data,
variables = list(
b_y_Intercept = intercept,
b_y_x = x_b,
b_trty_Intercept = trt_intercept,
b_trty_trt_x = trt_x_b,
sigma_y = obs_sigma,
my_ranef_raw = ranef_raw
))
}
# Generate a lot of datsets
set.seed(33214855)
N_sims <- 1000
ds <- generate_datasets(SBC_generator_function(generator_func), n_sims = N_sims)
# With 1000 datasets, this takes ~45 minutes on my computer
backend <-
SBC_backend_brms(f,
stanvars = predict_ranef_stanvars,
prior = priors,
template_data = combined_data,
chains = 2,
out_stan_file = file.path(cache_dir, "backend.stan")
)
```

To increase the power of SBC to detect problems, we will also add the log-likelihood and log-prior as derived quantities (see Modrák et al. 2023 or the limits of SBC vignette for background on this).

```
compute_loglik <- function(y, is_trt, x, trt_x, patient_id,
trt_id, intercept, x_b, trt_intercept, trt_x_b,
ranef_raw, sigma_y) {
patient_id <- patient_id[!is_trt]
trt_id <- trt_id[!is_trt]
x <- x[!is_trt]
y <- y[!is_trt]
trt_x <- trt_x[is_trt]
patient_trt_all <- matrix(nrow = length(patient_id), ncol = 2)
patient_trt_all[,1] <- patient_id
patient_trt_all[,2] <- trt_id
patient_trt_all <- unique(patient_trt_all)
patient_treatment <- integer(max(patient_id))
patient_treatment[patient_trt_all[, 1]] <- patient_trt_all[, 2]
ranef_sigma <- exp(trt_intercept + trt_x * trt_x_b)
ranef_vals <- ranef_raw * ranef_sigma[patient_treatment]
mu <- intercept + x * x_b + ranef_vals[patient_id]
sum(dnorm(y, mean = mu, sd = sigma_y, log = TRUE))
}
dq <- derived_quantities(
lprior_fixed = dnorm(b_y_Intercept, log = TRUE) +
dnorm(b_y_x, log = TRUE) +
dnorm(b_trty_Intercept, mean = trt_intercept_prior_mu,
sd = trt_intercept_prior_sigma, log = TRUE) +
dnorm(b_trty_trt_x, log = TRUE) +
dgamma(sigma_y, gamma_shape, gamma_rate, log = TRUE),
loglik = compute_loglik(y = y, is_trt = is_trt, x = x, trt_x = trt_x,
patient_id = patient_id, trt_id = trt_id,
intercept = b_y_Intercept, x_b = b_y_x,
trt_intercept = b_trty_Intercept,
trt_x_b = b_trty_trt_x, ranef_raw = my_ranef_raw,
sigma = sigma_y),
.globals = c("compute_loglik", "gamma_shape", "gamma_rate",
"trt_intercept_prior_mu", "trt_intercept_prior_sigma")
)
```

We are now ready to actually run SBC:

```
sbc_res <-
compute_SBC(
ds,
backend,
dquants = dq,
cache_mode = "results",
cache_location = file.path(cache_dir, paste0("sbc", N_sims, ".rds")),
keep_fits = N_sims <= 50
)
```

`## Results loaded from cache file 'sbc1000.rds'`

`## - 305 (30%) fits had at least one Rhat > 1.01. Largest Rhat was 1.39.`

```
## - 2 (0%) fits had tail ESS undefined or less than half of the maximum rank, potentially skewing
## the rank statistics. The lowest tail ESS was 13.
## If the fits look good otherwise, increasing `thin_ranks` (via recompute_SBC_statistics)
## or number of posterior draws (by refitting) might help.
```

`## - 2 (0%) fits had divergent transitions. Maximum number of divergences was 2.`

`## - 11 (1%) fits had iterations that saturated max treedepth. Maximum number of max treedepth was 2000.`

`## - 999 (100%) fits had some steps rejected. Maximum number of rejections was 29.`

```
## Not all diagnostics are OK.
## You can learn more by inspecting $default_diagnostics, $backend_diagnostics
## and/or investigating $outputs/$messages/$warnings for detailed output from the backend.
```

There are still some convergence problems for some fits — the most worrying are the high Rhats, affecting almost a third of the fits. This should definitely warrant some further investigation, but the Rhats are not very large and this is a blog post, not a research paper, so we will not go down this rabbit hole.

A very small number of fits had divergences/treedepth issues, but due to the small number, those are not so worrying. The steps rejected is completely benign as this includes rejections during warmup.

Overall, it is in fact safe to just ignore the problematic fits as long as you would not use results from such fits in actual practice (which you shouldn’t) — see the rejection sampling vignette for more details.

We plot the results of the ECDF diff check — looking good!

```
vars <- sbc_res$stats %>% filter(!grepl("my_", variable)) %>%
pull(variable) %>% unique() %>% c("my_ranef_raw[1]", "my_ranef_raw[2]")
excluded_fits <- sbc_res$backend_diagnostics$n_divergent > 0 |
sbc_res$backend_diagnostics$n_max_treedepth > 0 |
sbc_res$default_diagnostics$min_ess_tail < 200 |
sbc_res$default_diagnostics$max_rhat > 1.01
sbc_res_filtered <- sbc_res[!excluded_fits]
plot_ecdf_diff(sbc_res_filtered, variables = vars)
```

We can also see how close to the true values our estimates are — once again this looks quite good — we do learn quite a lot of information about all parameters except for the random effects!

`plot_sim_estimated(sbc_res_filtered, variables = vars, alpha = 0.1)`

And that’s all. If you encounter problems running the models that you can’t resolve yourself, be sure to ask questions on Stan Discourse and tag me (@martinmodrak) in the question!

This post was built from Git revision `c9ed4bf49cb0082db9ad33f42c00c964d158c3f9`

, you can download the `renv.lock`

file required to reconstruct the environment.

`sessionInfo()`

```
## R version 4.3.2 (2023-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=Czech_Czechia.utf8 LC_CTYPE=Czech_Czechia.utf8
## [3] LC_MONETARY=Czech_Czechia.utf8 LC_NUMERIC=C
## [5] LC_TIME=Czech_Czechia.utf8
##
## time zone: Europe/Prague
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices datasets utils methods base
##
## other attached packages:
## [1] SBC_0.2.0.9000 bayesplot_1.11.1 knitr_1.45 lubridate_1.9.3
## [5] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2
## [9] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.4.4
## [13] tidyverse_2.0.0 brms_2.20.4 Rcpp_1.0.12 cmdstanr_0.7.1
##
## loaded via a namespace (and not attached):
## [1] gridExtra_2.3 inline_0.3.19 rlang_1.1.3
## [4] magrittr_2.0.3 matrixStats_1.2.0 compiler_4.3.2
## [7] loo_2.6.0 vctrs_0.6.5 reshape2_1.4.4
## [10] pkgconfig_2.0.3 fastmap_1.1.1 backports_1.4.1
## [13] ellipsis_0.3.2 labeling_0.4.3 utf8_1.2.4
## [16] threejs_0.3.3 promises_1.2.1 rmarkdown_2.25
## [19] tzdb_0.4.0 markdown_1.12 ps_1.7.6
## [22] xfun_0.42 cachem_1.0.8 jsonlite_1.8.8
## [25] highr_0.10 later_1.3.2 parallel_4.3.2
## [28] R6_2.5.1 dygraphs_1.1.1.6 bslib_0.6.1
## [31] stringi_1.8.3 StanHeaders_2.32.5 parallelly_1.37.0
## [34] jquerylib_0.1.4 bookdown_0.37 rstan_2.32.5
## [37] zoo_1.8-12 base64enc_0.1-3 timechange_0.3.0
## [40] httpuv_1.6.14 Matrix_1.6-1.1 igraph_2.0.1.1
## [43] tidyselect_1.2.0 rstudioapi_0.15.0 abind_1.4-5
## [46] yaml_2.3.8 codetools_0.2-19 miniUI_0.1.1.1
## [49] blogdown_1.19 processx_3.8.3 listenv_0.9.1
## [52] pkgbuild_1.4.3 lattice_0.21-9 plyr_1.8.9
## [55] shiny_1.8.0 withr_3.0.0 bridgesampling_1.1-2
## [58] posterior_1.5.0 coda_0.19-4.1 evaluate_0.23
## [61] future_1.33.1 RcppParallel_5.1.7 xts_0.13.2
## [64] pillar_1.9.0 tensorA_0.36.2.1 checkmate_2.3.1
## [67] renv_1.0.2 DT_0.31 stats4_4.3.2
## [70] shinyjs_2.1.0 distributional_0.4.0 generics_0.1.3
## [73] hms_1.1.3 rstantools_2.4.0 munsell_0.5.0
## [76] scales_1.3.0 globals_0.16.2 gtools_3.9.5
## [79] xtable_1.8-4 glue_1.7.0 tools_4.3.2
## [82] shinystan_2.6.0 colourpicker_1.3.0 mvtnorm_1.2-4
## [85] cowplot_1.1.3 grid_4.3.2 QuickJSR_1.1.3
## [88] crosstalk_1.2.1 colorspace_2.1-0 nlme_3.1-163
## [91] cli_3.6.2 fansi_1.0.6 Brobdingnag_1.2-9
## [94] gtable_0.3.4 sass_0.4.8 digest_0.6.34
## [97] farver_2.1.1 htmlwidgets_1.6.4 memoise_2.0.1
## [100] htmltools_0.5.7 lifecycle_1.0.4 mime_0.12
## [103] shinythemes_1.2.0
```

But in short, the basic idea is that you implement your model twice: beyond a probabilistic program (e.g. in #Stan, #jags, …) + a sampling algorithm you also need a simulator drawing from the prior distribution - this tends to be easy to implement. You then simulate multiple datasets, fit those with your probabilistic program and compute ranks of the prior parameter values withing the posterior. If you did everything correct, the ranks are uniform. Non-unifomity then signals a problem.

The gist of our new paper is that strength of SBC depends on the choice of “test quantities” for which you compute ranks. The default approach is to take all individual parameter values. This is already a very useful check, but it leaves a big class of bugs you can’t detect (e.g. when posterior is just the prior distribution). However, when you add derived test quantities, combining the parameters with the simulated data, you can (in theory) detect any problem! (Yaaaay!) But in theory you may need infinitely many quantities :-(.

In practice, it seems to be quite sufficient to add just a few additional test quantities beyond the default. In particular, our experience as well as theoretical considerations indicate that the model likelihood is very sensitive. The power of SBC is still limited by the number of simulated datasets you can reasonably fit which primarily limits how big discrepancies can go undetected.

More generally, we provide a firmly grounded theoretical analysis of SBC which will hopefully help others to build better intuition on how and why it works in practice. Notably, we make it clear that SBC does not check whether “data-averaged posterior equals the prior” as is sometimes claimed in the literature (and as I also was convinced when starting this project :-D )

The SBC package supports all of the ideas discussed in the paper in R. I personally now use SBC for almost all my Stan work from the get go – although there is some extra work to setup SBC, you end up detecting bugs early and thus save time in the long run. If you want to see an example of simulation-driven model development workflow, check out the Small model implementation workflow vignette.

This work was made possible by generous help from Angie H. Moon, Shinyoung Kim, Paul Bürkner, Niko Huurre, Kateřina Faltejsková, Andrew Gelman and Aki Vehtari.

]]>Nathaniel Haines made a neat tweet showing off his model of reaction times that handles possible contamination with both implausibly short reaction times (e.g., if people make an anticipatory response that is not actually based on processing the stimulus of interest) or implausibly large reaction times (e.g., if their attention drifts away from the task, but they snap back to it after having “zoned out” for a few seconds). Response times that arise from such processes are not actually what we aim to measure in most cognitive tasks — we are instead interested in how people process and respond to a particular stimulus. Therefore, by explicitly modeling these “contamination” response times, we can get better estimates of the decision-making parameters that we actually care about. Such a model often makes more sense than just throwing away a part of the data.

Several people asked, if you can do that in `brms`

. This started a vortex of productive procrastrination on my side - it sure should be easy to do this, right? And while Nathaniel didn’t have a `brms`

code ready, I assure you that, yes, it is possible in `brms`

, it is not completely straightforward, but I’ll show you the ~~path~~ code.

Nathaniel was kind enough to provide a bit of feedback on the post (I have no experience with reaction-time data or cogsci in general), but I should repeat that the clarity of the idea is his while all errors are mine. The overall idea of using mixtures of “real” and “contaminating” distributions to model responses is however not new - see e.g. Ratcliff & Tuerlinckx 2002.

In this model we will take a shifted lognormal representing the actual decision process and a uniform distribution modelling the contamination. For this to make sense, we need to have some upper limit on the possible contamination times, representing the maximum times we could have observed. In most cases, the limit should be larger than the maximum time we observed, although this is not strictly mathematically necessary. We then assume that each trial has a small probability of being contaminated.

Here is how generating a single data point for such a model could look in R code:

```
shift <- 0.1 # Shortest reaction time possible if not contaminated
mu <- log(0.5)
sigma <- 0.6
mix <- 0.06 # Probability of contamination
upper <- 5 # Maximum time of contamination
if(runif(1) < mix) {
# Contaminated
y <- runif(1, 0, upper)
} else {
# Non-contaminated
y <- shift + rlnorm(1, mu, sigma)
}
```

The same could be expressed in math as:

\[ y_i = \begin{cases} u_i & \mathrm{if} \quad z_i = 0 \\ s_i + r_i & \mathrm{if} \quad z_i = 1 \end{cases} \\ u_i \sim Uniform(0, \alpha) \\ \log(r_i) \sim Normal(\mu_i, \sigma) \\ P(z_i = 0) = \theta \]

Where \(\theta\) corresponds to `mix`

, \(\alpha\) to `upper`

and \(s_i\) to `shift`

.

Technically, the non-contaminated signal is allowed to take values larger than `upper`

. In practice we would however usually want `upper`

to be large enough that larger values do not really occur.

There is one important detail in how `brms`

does handle the shifted lognormal: `brms`

does treat `shift`

as unknown and estimates it, but does not allow the `shift`

parameter to be larger than any actually observed `y`

. We will therefore mimic this behaviour, but since we also have the contamination process, `shift`

can in principle be larger than some `y`

.
This can potentially introduce problems for the sampler as the posterior density is not smooth when `shift`

crosses some of the observed `y`

values (the lognormal component is added/removed, resulting in a sharp change).

It however turns out that if `shift`

crossing some `y`

is rare enough, the sampling works just fine. To ensure this rarity we introduce `max_shift`

as the upper bound for `shift`

. In most cases, this will be the same for the whole dataset. Instead of `shift`

, the model would then work with `shiftprop = shift / max_shift`

- a value between 0 and 1 that is easier to work with mathematically.

Of the model parameters, we take `max_shift`

and `upper`

as known (but possibly differ between observations) while `mu`

, `sigma`

, `mix`

and `shiftprop`

are to be estimated and can depend on predictors. However, `shiftprop`

is a bit complicated here and the model will make most sense if observations that have different `max_shift`

are also allowed to have different `shiftprop`

by putting a suitable predictor on `shiftprop`

. Different `shiftprop`

with the same `max_shift`

is however definitely not an issue. So while you need to be careful with varying `max_shift`

, varying `shiftprop`

is OK, just note the implied logit scale. For a review on why varying `shift`

might be important see e.g. Dully, McGovern & O’Connell 2018.

For some use cases, one could also want to set the lower bound of the contamination distribution. To keep things simple we don’t do that here, but basically the same result can then be achieved by adding/subtracting a suitable number to the response (`y`

) and bounds (`max_shift`

, `upper`

)

Some experimental designs also involve a limit on the maximum time the response could have taken. In such contexts, it might make sense to treat the values as right-censored. `brms`

supports censoring for most families, so we want our implementation to be compatible with it.

Our goal is that at the end of the post we will be able to write models like

```
brm(bf(y | vreal(max_shift, upper) + cens(censoring) ~ 1 + condition + (1 | subject_id),
sigma ~ condition,
mix ~ (1 | subject_id),
family = RTmixture), ...)
```

And let `brms`

handle all the rest. The final result, packaged in a single file you
can just load into your project is at https://github.com/martinmodrak/blog/blob/master/content/post/RTmixture.R However, be advised that the code was only quite shallowly tested, so look both ways before crossing and test if you can recover parameters from simulated data before trusting me too much.

You may know that, `brms`

has good support for mixtures, so why not just write `family = mixture(uniform, shifted_lognormal)`

? It turns out `brms`

has as one of its core assumptions that every family has at least one parameter to be estimated - our uniform distribution for the contamination parameter however does not have that and thus cannot be used with `brms`

directly. So instead we’ll have to implement a full blown custom family.

The necessary background for implementing custom families in `brms`

can be found in
the vignette on custom distributions.
Here, we will explain only the more weird stuff.

Let’s set up and get our hands dirty.

```
library(cmdstanr)
library(brms)
library(tidyverse)
library(knitr)
library(patchwork)
library(bayesplot)
source("RTmixture.R")
ggplot2::theme_set(cowplot::theme_cowplot())
options(mc.cores = parallel::detectCores(), brms.backend = "cmdstanr")
cache_dir <- "_RTmixture_cache"
if(!dir.exists(cache_dir)) {
dir.create(cache_dir)
}
```

First, we’ll generate some fake data to test the model against. Below is just a more concise and optimized version of the random generation scheme I showed earlier.

```
rRTmixture <- function(n, meanlog, sdlog, mix, shift, upper) {
ifelse(runif(n) < mix,
runif(n, 0, upper),
shift + rlnorm(n, meanlog = meanlog, sdlog = sdlog))
}
```

Then let us generate some data

```
set.seed(31546522)
# Bounds of the data
max_shift <- 0.3
shift <- runif(1) * max_shift
upper <- 10
mix <- 0.1
N <- 100
Intercept <- 0.3
beta <- 0.5
X <- rnorm(N)
mu <- rep(Intercept, N) + beta * X
sigma <- 0.5
rt <- rRTmixture(N, meanlog = mu, sdlog = sigma, mix = mix, shift = shift, upper = upper)
dd <- data.frame(rt = rt, x = X, max_shift = max_shift, upper = upper)
```

Looking nice!

`ggplot(dd, aes(x = rt)) + geom_density()`

Now we need the Stan implementation of the family. That is probably the most technical part.
Stan user’s guide has some background on mixture models in Stan.
We’ll note that times before `shift`

can only come from the uniform component and times
after `upper`

can only come from the lognormal component.
For others we mix both a lognormal and the uniform via `log_mix`

.

With the Stan code ready, we then define the parameters of the distribution in a way that brms understands.

```
stan_funs_base <- stanvar(block = "functions", scode = "
real RTmixture_lpdf(real y, real mu, real sigma, real mix,
real shiftprop, real max_shift, real upper) {
real shift = shiftprop * max_shift;
if(y <= shift) {
// Could only be created by the contamination
return log(mix) + uniform_lpdf(y | 0, upper);
} else if(y >= upper) {
// Could only come from the lognormal
return log1m(mix) + lognormal_lpdf(y - shift | mu, sigma);
} else {
// Actually mixing
real lognormal_llh = lognormal_lpdf(y - shift | mu, sigma);
real uniform_llh = uniform_lpdf(y | 0, upper);
return log_mix(mix, uniform_llh, lognormal_llh);
}
}
")
RTmixture <- custom_family(
"RTmixture",
dpars = c("mu", "sigma", "mix", "shiftprop"), # Those will be estimated
links = c("identity", "log", "logit", "logit"),
type = "real",
lb = c(NA, 0, 0, 0), # bounds for the parameters
ub = c(NA, NA, 1, 1),
vars = c("vreal1[n]", "vreal2[n]") # Data for max_shift and upper (known)
)
```

And we are ready to fit! We will put a weakly informative `beta(1,5)`

prior on the proportion of
contamination - this means we a prior believe that there is a 95% chance that the contamination is lower than `qbeta(0.95, 1, 5) = 0.4507197`

. One could definitely be justified in tightening this prior even further toward zero for many tasks. `vreal`

is just `brms`

’s way of annotating arbitrary additional data for the distribution. We need to pass both
the family and the associated `stanvars`

.

```
fit_mix <- brm(rt | vreal(max_shift, upper) ~ x, data = dd, family = RTmixture,
stanvars = stan_funs_base,
refresh = 0,
file = paste0(cache_dir, "/mix"), file_refit = "on_change",
prior = c(prior(beta(1, 5), class = "mix")))
fit_mix
```

```
## Family: RTmixture
## Links: mu = identity; sigma = identity; mix = identity; shiftprop = identity
## Formula: rt | vreal(max_shift, upper) ~ x
## Data: dd (Number of observations: 100)
## Draws: 4 chains, each with iter = 1000; warmup = 0; thin = 1;
## total post-warmup draws = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 0.33 0.07 0.20 0.48 1.00 2097 2242
## x 0.44 0.05 0.33 0.54 1.00 2368 2460
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 0.45 0.05 0.36 0.56 1.00 2379 1976
## mix 0.13 0.05 0.05 0.24 1.00 3159 2589
## shiftprop 0.82 0.18 0.32 1.00 1.00 1640 1623
##
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

We note that we have quite good recovery of the effect of `x`

(simulated as 0.5)
and of `sigma`

(which was 0.5), but 100 observations are not enough to constrain the `mix`

parameter really well (simulated as 0.1).

For comparison, we also fit the default shifted lognormal as implemented in `brms`

.

```
fit_base <- brm(rt ~ x, data = dd, family = shifted_lognormal, refresh = 0,
file = paste0(cache_dir, "/base"), file_refit = "on_change")
fit_base
```

```
## Family: shifted_lognormal
## Links: mu = identity; sigma = identity; ndt = identity
## Formula: rt ~ x
## Data: dd (Number of observations: 100)
## Draws: 4 chains, each with iter = 1000; warmup = 0; thin = 1;
## total post-warmup draws = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 0.34 0.09 0.16 0.53 1.00 1958 1949
## x 0.47 0.07 0.33 0.61 1.00 2711 2495
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 0.69 0.07 0.56 0.82 1.00 1883 1802
## ndt 0.34 0.07 0.17 0.44 1.00 1634 1420
##
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

We see that the inferences for `sigma`

are a bit biased but this is not necessarily only due to the mixture,
another potentially biasing is the different handling of the shift.

To support censoring in `brms`

the family has to come with log CDF (cumulative distribution function) and log CCDF (complementary CDF) implementations in Stan, which we provide below.
Those match the `_lpdf`

pretty closely.

```
stan_funs <- stan_funs_base + stanvar(block = "functions", scode = "
real RTmixture_lcdf(real y, real mu, real sigma, real mix,
real shiftprop, real max_shift, real upper) {
real shift = shiftprop * max_shift;
if(y <= shift) {
return log(mix) + uniform_lcdf(y | 0, upper);
} else if(y >= upper) {
// The whole uniform part is below, so the mixture part is log(1) = 0
return log_mix(mix, 0, lognormal_lcdf(y - shift | mu, sigma));
} else {
real lognormal_llh = lognormal_lcdf(y - shift | mu, sigma);
real uniform_llh = uniform_lcdf(y | 0, upper);
return log_mix(mix, uniform_llh, lognormal_llh);
}
}
real RTmixture_lccdf(real y, real mu, real sigma, real mix,
real shiftprop, real max_shift, real upper) {
real shift = shiftprop * max_shift;
if(y <= shift) {
// The whole lognormal part is above, so the mixture part is log(1) = 0
return log_mix(mix, uniform_lccdf(y | 0, upper), 0);
} else if(y >= upper) {
return log1m(mix) + lognormal_lccdf(y - shift | mu, sigma);
} else {
real lognormal_llh = lognormal_lccdf(y - shift | mu, sigma);
real uniform_llh = uniform_lccdf(y | 0, upper);
return log_mix(mix, uniform_llh, lognormal_llh);
}
}
")
```

To test if this work, we’ll do quite aggressive censoring and treat anything larger than 1.5 as censored. In most cases it makes sense to have `upper`

be the same as the censoring bound, so we’ll do that

```
set.seed(25462255)
shift <- 0.15
cens_bound <- upper <- 1.5
mix <- 0.08
N <- 110
Intercept <- 0.5
beta <- -0.3
X <- rnorm(N)
mu <- rep(Intercept, N) + beta * X
sigma <- 0.4
rt <- rRTmixture(N, meanlog = mu, sdlog = sigma,
mix = mix, shift = shift, upper = upper)
censored <- rt > cens_bound
rt[censored] <- cens_bound
dd_cens <- data.frame(rt = rt,
censored = if_else(censored, "right", "none"),
x = X, max_shift = shift, upper = upper)
```

Finally, this model starts to be problematic if we try to estimate `shift`

(well, actually `shiftprop`

) as well. An easy way to to make `shift`

always equal to `max_shift`

is to set a constant prior on `shiftprop`

, as we do below.

```
fit_mix_cens <- brm(rt | vreal(max_shift, upper) + cens(censored) ~ x,
data = dd_cens,
family = RTmixture,
stanvars = stan_funs,
refresh = 0,
file = paste0(cache_dir, "/mix_cens"),
file_refit = "on_change",
prior = c(prior(beta(1, 5), class = "mix"),
prior(constant(1), class = "shiftprop")))
fit_mix_cens
```

```
## Family: RTmixture
## Links: mu = identity; sigma = identity; mix = identity; shiftprop = identity
## Formula: rt | vreal(max_shift, upper) + cens(censored) ~ x
## Data: dd_cens (Number of observations: 110)
## Draws: 4 chains, each with iter = 1000; warmup = 0; thin = 1;
## total post-warmup draws = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 0.46 0.08 0.34 0.65 1.00 2022 1674
## x -0.23 0.07 -0.38 -0.12 1.00 2094 1606
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 0.36 0.07 0.26 0.53 1.00 2019 2565
## mix 0.15 0.05 0.07 0.25 1.00 2943 2595
## shiftprop 1.00 0.00 1.00 1.00 NA NA NA
##
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

It works and the inferences are reasonably close to what we simulated with. A more thorough evaluation would require simulation-based calibration, which would be nice, but would require a bit more energy than I have now. But it seems that at least the models are not completely wrong.

If you want to model varying `shift`

but having issues fitting, it might make sense to adjust `max_shift`

on a per-group basis to have `max_shift`

larger only than a small proportion of observations in this group. As noted above, if you set different `max_shift`

per for example `subject_id`

, you should also have `shiftprop ~ subject_id`

or the model might not make sense.

We successfully fitted a few models, but there are some tweaks we need to do to make full use of the family. We might for example want to make predictions - e.g. to make posterior predictive checks - so we also need to implement prediction code. You’ll notice that we are just extracting the parameters from the prepared predictions and passing those to the generator function we defined earlier.

```
posterior_predict_RTmixture <- function(i, prep, ...) {
if((!is.null(prep$data$lb) && prep$data$lb[i] > 0) ||
(!is.null(prep$data$ub) && prep$data$ub[i] < Inf)) {
stop("Predictions for truncated distributions not supported")
}
mu <- brms:::get_dpar(prep, "mu", i = i)
sigma <- brms:::get_dpar(prep, "sigma", i = i)
mix <- brms:::get_dpar(prep, "mix", i = i)
shiftprop <- brms:::get_dpar(prep, "shiftprop", i = i)
max_shift <- prep$data$vreal1[i]
upper <- prep$data$vreal2[i]
shift = shiftprop * max_shift
rRTmixture(prep$ndraws, meanlog = mu, sdlog = sigma,
mix = mix, shift = shift, upper = upper)
}
```

Note that the `get_dpar`

helper that simplifies some bookeeping is currently internal in `brms`

, but will be exposed in upcoming release.

With that, we can do a posterior predictive check for both models. We use only single core for predictions, because on Windows, multicore is slow and will not be able to access the custom prediction functions.

```
pp_mix <- pp_check(fit_mix, type = "dens_overlay", ndraws = 100, cores = 1) +
ggtitle("Mixture")
pp_base <- pp_check(fit_base, type = "dens_overlay", ndraws = 100, cores = 1) +
ggtitle("Shifted lognormal")
pp_mix / pp_base
```

For this dataset, the mixture is not doing that much in improving the bulk of the predictions, but it manages to avoid the very long tail the lognormal-only model needs to accomodate the larger values.

We might also look at checks of the censored model. `brms`

does not directly support
predicting censored variables (because the data passed to the model are not enough to completely determine all censoring), but we can easily do this manually:

```
set.seed(123566)
pred_cens <- posterior_predict(fit_mix_cens, cores = 1)
pred_cens_cens <- pred_cens
# Do the censoring
pred_cens_cens[pred_cens > cens_bound] <- cens_bound
samples_dens <- sample(1:(dim(pred_cens)[1]), size = 50)
ppc_cens1 <- ppc_dens_overlay(dd_cens$rt, pred_cens_cens[samples_dens,]) +
ggtitle("Censored dataset")
ppc_cens2 <- ppc_stat(1.0 * (dd_cens$censored == "right"),
1.0 * (pred_cens >= cens_bound),
binwidth = 0.02) +
ggtitle("Proportion censored")
ppc_cens1 + ppc_cens2
```

The model seems to do OK.

Similarly, we might want to do model comparison or stacking with `loo`

, so we also implement
the `log_lik`

function.

```
## Needed for numerical stability
## from http://tr.im/hH5A
logsumexp <- function (x) {
y = max(x)
y + log(sum(exp(x - y)))
}
RTmixture_lpdf <- function(y, meanlog, sdlog, mix, shift, upper) {
unif_llh = dunif(y , min = 0, max = upper, log = TRUE)
lognormal_llh = dlnorm(y - shift, meanlog = meanlog, sdlog = sdlog, log = TRUE) -
plnorm(upper - shift, meanlog = meanlog, sdlog = sdlog, log.p = TRUE)
# Computing logsumexp(log(mix) + unif_llh, log1p(-mix) + lognormal_llh)
# but vectorized
llh_matrix <- array(NA_real_, dim = c(2, max(length(unif_llh), length(lognormal_llh))))
llh_matrix[1,] <- log(mix) + unif_llh
llh_matrix[2,] <- log1p(-mix) + lognormal_llh
apply(llh_matrix, MARGIN = 2, FUN = logsumexp)
}
log_lik_RTmixture <- function(i, draws) {
mu <- brms:::get_dpar(draws, "mu", i = i)
sigma <- brms:::get_dpar(draws, "sigma", i = i)
mix <- brms:::get_dpar(draws, "mix", i = i)
shiftprop <- brms:::get_dpar(draws, "shiftprop", i = i)
max_shift <- draws$data$vreal1[i]
upper <- draws$data$vreal2[i]
shift = shiftprop * max_shift
y <- draws$data$Y[i]
RTmixture_lpdf(y, meanlog = mu, sdlog = sigma,
mix = mix, shift = shift, upper = upper)
}
```

And now, we can compare the models:

```
fit_mix <- add_criterion(fit_mix, "loo", cores = 1)
fit_base <- add_criterion(fit_base, "loo", cores = 1)
loo_compare(fit_mix, fit_base)
```

```
## elpd_diff se_diff
## fit_mix 0.0 0.0
## fit_base -7.5 5.3
```

No surprise here - we simulated the data with the mixture model and indeed, this is preferred to a different model. Also, the shifted-lognormal model has one very influential observation, which turns out to be the smallest observed reaction time.

`dd$rt[fit_base$criteria$loo$diagnostics$pareto_k > 0.7]`

`## [1] 0.4909057`

`min(dd$rt)`

`## [1] 0.4909057`

This once again shows that the lognormal has problem accomodating both the high and low contamination (while it is plausible it could accomodate a small amount of just high or just low contamination quite well).

Since `brms`

is great, we can now do all sorts of stuff like put predictors on the `mix`

parameter - e.g. to get a per-subject estimate of the amount of contamination.

To do this, we’ll also put a weakly informative prior on the intercept for the mixture that assumes low contamination and we don’t expect huge variability in the amount of contamination (with wider priors the model starts to diverge as we would need much more data to constrain it well).

```
set.seed(35486622)
dd_subj <- dd_cens
dd_subj$subject_id <- sample(1:12, size = nrow(dd_cens), replace = TRUE)
fit_mix_all <- brm(
bf(rt | vreal(max_shift, upper) + cens(censored) ~ x,
mix ~ 1 + (1 | subject_id),
family = RTmixture),
data = dd_subj,
stanvars = stan_funs,
refresh = 0,
file = paste0(cache_dir, "/mix_all"), file_refit = "on_change",
prior = c(prior(normal(-3, 1), class = "Intercept", dpar = "mix"),
prior(normal(0,0.5), class = "sd", dpar = "mix"),
prior(constant(1), class = "shiftprop")))
fit_mix_all
```

```
## Family: RTmixture
## Links: mu = identity; sigma = identity; mix = logit; shiftprop = identity
## Formula: rt | vreal(max_shift, upper) + cens(censored) ~ x
## mix ~ 1 + (1 | subject_id)
## Data: dd_subj (Number of observations: 110)
## Draws: 4 chains, each with iter = 1000; warmup = 0; thin = 1;
## total post-warmup draws = 4000
##
## Group-Level Effects:
## ~subject_id (Number of levels: 12)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(mix_Intercept) 0.38 0.29 0.02 1.07 1.00 2706 2137
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 0.45 0.07 0.33 0.62 1.00 3586 2305
## mix_Intercept -2.05 0.44 -2.96 -1.25 1.00 4207 2807
## x -0.22 0.06 -0.37 -0.11 1.00 3715 2666
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 0.37 0.07 0.26 0.54 1.00 3534 3025
## shiftprop 1.00 0.00 1.00 1.00 NA NA NA
##
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

Checking that posterior predictions work:

```
set.seed(1233354)
pred_cens <- posterior_predict(fit_mix_all, cores = 1)
pred_cens_cens <- pred_cens
pred_cens_cens[pred_cens > cens_bound] <- cens_bound
samples_dens <- sample(1:(dim(pred_cens)[1]), size = 50)
ppc_dens_overlay(dd_cens$rt, pred_cens_cens[samples_dens,])
```

We can also do multivariate models where some of the predictors are correlated across answers:

```
set.seed(0245562)
# Build a dataset containing two separate predictions
dd_both <- dd
dd_both$rt2 <- dd_cens$rt[1:nrow(dd_both)]
dd_both$x2 <- dd_cens$x[1:nrow(dd_both)]
dd_both$censored2 <- dd_cens$censored[1:nrow(dd_both)]
dd_both$max_shift2 <- dd_cens$max_shift[1:nrow(dd_both)]
dd_both$upper2 <- dd_cens$upper[1:nrow(dd_both)]
dd_both$subject_id <- sample(1:12, size = nrow(dd_both), replace = TRUE)
fit_mix_multivar <- brm(
bf(rt | vreal(max_shift, upper) ~ x,
mix ~ 1 + (1 | p | subject_id),
family = RTmixture) +
bf(rt2 | vreal(max_shift2, upper2) + cens(censored2) ~ x2,
mix ~ 1 + (1 | p | subject_id),
family = RTmixture),
data = dd_both,
stanvars = stan_funs,
refresh = 0,
file = paste0(cache_dir, "/mix_multivar"), file_refit = "on_change",
prior = c(prior(normal(-3, 1), class = "Intercept", dpar = "mix", resp = "rt"),
prior(normal(0,0.5), class = "sd", dpar = "mix", resp = "rt"),
prior(constant(1), class = "shiftprop", resp = "rt"),
prior(normal(-3, 1), class = "Intercept", dpar = "mix", resp = "rt2"),
prior(normal(0,0.5), class = "sd", dpar = "mix", resp = "rt2"),
prior(constant(1), class = "shiftprop", resp = "rt2")
),
adapt_delta = 0.95
)
```

`## Setting 'rescor' to FALSE by default for this model`

`fit_mix_multivar`

```
## Family: MV(RTmixture, RTmixture)
## Links: mu = identity; sigma = identity; mix = logit; shiftprop = identity
## mu = identity; sigma = identity; mix = logit; shiftprop = identity
## Formula: rt | vreal(max_shift, upper) ~ x
## mix ~ 1 + (1 | p | subject_id)
## rt2 | vreal(max_shift2, upper2) + cens(censored2) ~ x2
## mix ~ 1 + (1 | p | subject_id)
## Data: dd_both (Number of observations: 100)
## Draws: 4 chains, each with iter = 1000; warmup = 0; thin = 1;
## total post-warmup draws = 4000
##
## Group-Level Effects:
## ~subject_id (Number of levels: 12)
## Estimate Est.Error l-95% CI u-95% CI
## sd(mix_rt_Intercept) 0.33 0.25 0.01 0.91
## sd(mix_rt2_Intercept) 0.37 0.27 0.02 1.01
## cor(mix_rt_Intercept,mix_rt2_Intercept) 0.00 0.58 -0.95 0.95
## Rhat Bulk_ESS Tail_ESS
## sd(mix_rt_Intercept) 1.00 2713 1787
## sd(mix_rt2_Intercept) 1.00 2587 2046
## cor(mix_rt_Intercept,mix_rt2_Intercept) 1.00 4331 2836
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## rt_Intercept 0.30 0.06 0.18 0.43 1.00 4743 3085
## mix_rt_Intercept -2.28 0.54 -3.50 -1.35 1.00 4035 1843
## rt2_Intercept 0.43 0.08 0.31 0.60 1.00 3258 1826
## mix_rt2_Intercept -1.91 0.44 -2.80 -1.11 1.00 4456 3016
## rt_x 0.46 0.05 0.36 0.56 1.00 5440 2978
## rt2_x2 -0.25 0.07 -0.40 -0.13 1.00 3316 2106
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma_rt 0.49 0.05 0.40 0.60 1.00 4907 3014
## sigma_rt2 0.36 0.07 0.25 0.53 1.00 3449 2885
## shiftprop_rt 1.00 0.00 1.00 1.00 NA NA NA
## shiftprop_rt2 1.00 0.00 1.00 1.00 NA NA NA
##
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

Testing that predictions work even for multivariate models. Note that we don’t bother with censoring for `rt2`

so the predictions look wrong.

`pp_check(fit_mix_multivar, resp = "rt", ndraws = 30, cores = 1)`

`pp_check(fit_mix_multivar, resp = "rt2", ndraws = 30, cores = 1)`

But here we’ll also have to face possibly the biggest problem with `brms`

: that it becomes *very* easy to specify a model that is too complex to be well informed by the data we have or to even build a completely broken model that no amount of data will save. The data and a few settings for the “crazy” models shown above have actually had to be tweaked for them to work well for this post. So enjoy with moderation :-).

Again, if you want the complete code, packaged in a single file you can just load into your project, go to https://github.com/martinmodrak/blog/blob/master/content/post/RTmixture.R

If you encounter problems running the models that you can’t resolve yourself, be sure to ask questions on Stan Discourse and tag me (@martinmodrak) in the question!

`sessionInfo()`

```
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] bayesplot_1.8.1 patchwork_1.1.1 knitr_1.33
## [4] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.7
## [7] purrr_0.3.4 readr_2.0.1 tidyr_1.1.3
## [10] tibble_3.1.3 ggplot2_3.3.5 tidyverse_1.3.1
## [13] brms_2.16.2 Rcpp_1.0.7 cmdstanr_0.4.0.9000
##
## loaded via a namespace (and not attached):
## [1] readxl_1.3.1 backports_1.2.1 plyr_1.8.6
## [4] igraph_1.2.6 splines_4.1.0 crosstalk_1.1.1
## [7] rstantools_2.1.1 inline_0.3.19 digest_0.6.27
## [10] htmltools_0.5.1.1 rsconnect_0.8.24 fansi_0.5.0
## [13] magrittr_2.0.1 checkmate_2.0.0 tzdb_0.1.2
## [16] modelr_0.1.8 RcppParallel_5.1.4 matrixStats_0.61.0
## [19] xts_0.12.1 prettyunits_1.1.1 colorspace_2.0-2
## [22] rvest_1.0.1 haven_2.4.3 xfun_0.25
## [25] callr_3.7.0 crayon_1.4.1 jsonlite_1.7.2
## [28] lme4_1.1-27.1 zoo_1.8-9 glue_1.4.2
## [31] gtable_0.3.0 V8_3.4.2 distributional_0.2.2
## [34] pkgbuild_1.2.0 rstan_2.26.3 abind_1.4-5
## [37] scales_1.1.1 mvtnorm_1.1-2 DBI_1.1.1
## [40] miniUI_0.1.1.1 xtable_1.8-4 stats4_4.1.0
## [43] StanHeaders_2.26.3 DT_0.18 htmlwidgets_1.5.3
## [46] httr_1.4.2 threejs_0.3.3 posterior_1.0.1
## [49] ellipsis_0.3.2 pkgconfig_2.0.3 loo_2.4.1
## [52] farver_2.1.0 sass_0.4.0 dbplyr_2.1.1
## [55] utf8_1.2.2 labeling_0.4.2 tidyselect_1.1.1
## [58] rlang_0.4.11 reshape2_1.4.4 later_1.2.0
## [61] munsell_0.5.0 cellranger_1.1.0 tools_4.1.0
## [64] cli_3.0.1 generics_0.1.0 broom_0.7.9
## [67] ggridges_0.5.3 evaluate_0.14 fastmap_1.1.0
## [70] yaml_2.2.1 processx_3.5.2 fs_1.5.0
## [73] nlme_3.1-152 mime_0.11 projpred_2.0.2
## [76] xml2_1.3.2 compiler_4.1.0 shinythemes_1.2.0
## [79] rstudioapi_0.13 curl_4.3.2 gamm4_0.2-6
## [82] reprex_2.0.1 bslib_0.2.5.1 stringi_1.7.3
## [85] highr_0.9 ps_1.6.0 blogdown_1.5
## [88] Brobdingnag_1.2-6 lattice_0.20-44 Matrix_1.3-3
## [91] nloptr_1.2.2.2 markdown_1.1 shinyjs_2.0.0
## [94] tensorA_0.36.2 vctrs_0.3.8 pillar_1.6.2
## [97] lifecycle_1.0.0 jquerylib_0.1.4 bridgesampling_1.1-2
## [100] cowplot_1.1.1 httpuv_1.6.1 R6_2.5.0
## [103] bookdown_0.22 promises_1.2.0.1 gridExtra_2.3
## [106] codetools_0.2-18 boot_1.3-28 colourpicker_1.1.0
## [109] MASS_7.3-54 gtools_3.9.2 assertthat_0.2.1
## [112] withr_2.4.2 shinystan_2.5.0 mgcv_1.8-35
## [115] parallel_4.1.0 hms_1.1.0 grid_4.1.0
## [118] coda_0.19-4 minqa_1.2.4 rmarkdown_2.10
## [121] shiny_1.6.0 lubridate_1.7.10 base64enc_0.1-3
## [124] dygraphs_1.1.1.6
```

I wrote this post test my understanding of the material - the logic of Bayes factors implies that there are multiple ways to compute the same Bayes factor, each providing a somewhat different intuition on how to interpret them. So we’ll see how we can compute the Bayes factor using a black-box ready-made package, then get the same number analytically via prior predictive density and the get the same number by writing a “supermodel” that includes both the individual models we are comparing.

We’ll do less math and theory here, anyone who prefers math first, examples later or wants a deeper dive into the theory should start by reading the paper.

We will fit a few models with the newer R interface for Stan CmdStanR and with brms.

```
library(cmdstanr)
library(brms)
library(tidyverse)
library(knitr)
ggplot2::theme_set(cowplot::theme_cowplot())
options(mc.cores = parallel::detectCores(), brms.backend = "cmdstanr")
document_output <- isTRUE(getOption('knitr.in.progress'))
if(document_output) {
table_format <- knitr::kable
} else {
table_format <- identity
}
```

**Note on notation:** I tried to be consistent and use plain symbols (\(y_1, z, ...\))
for variables, bold symbols (\(\mathbf{y}, \mathbf{\Sigma}\)) for vectors and matrices,
\(P(A)\) for the probability of event \(A\) and \(p(y)\) for the density of random variable.

We will keep stuff very simple. Our first contestant, the humble *null model* a.k.a. \(\mathcal{M}_1\)
will be that the \(K\) data points are independent draws from a standard normal distribution, i.e.:

\[ \mathcal{M}_1 : \mathbf{y} = \{y_1, ... , y_K\} \\ y_i \sim N(0,1) \] In code, simulating from such a model this would look like:

```
N <- 10 # Size of dataset
y <- rnorm(N, mean = 0, sd = 1)
```

The null model has faced a lot of rejection in their whole life, but has kept its spirit up despite all the adversity. But will it be enough?

The challenger will be the daring *intercept model* a.k.a. the destroyer of souls
a.k.a. \(\mathcal{M}_2\) that posits that there is an unknown,
almost surely non-zero mean of the normal distribution, i.e.:

\[ \mathcal{M}_2: \mathbf{y} = \{y_1, ... , y_K\} \\ y_i \sim N(\alpha, 1) \\ \alpha \sim N(0,2) \]

The corresponding R code would be:

```
N <- 10 # Size of dataset
alpha <- rnorm(1, 0, 2)
y <- rnorm(N, mean = alpha, sd = 1)
```

This comparison is basically the Bayesian alternative of a single sample t-test with fixed variance - so very simple indeed.

Finally let’s prepare some data for our contestants to chew on:

`y <- c(0.5,0.7, -0.4, 0.1)`

We will start by where most users probably start: invoking a statistical package to compute the Bayes factor.
Here we will use the `hypothesis`

function from `brms`

which uses the Savage-Dickey method under the hood.

For this, we note that the null model is just a special case of the intercept model, destroyer of suns.
So let us fit this model (\(\mathcal{M}_2\)) in `brms`

. We will be using a lot of samples to reduce estimator error
as Bayes factors can be quite sensitive.

```
cache_dir <- "_bf_cache"
if(!dir.exists(cache_dir)) {
dir.create(cache_dir)
}
fit_brms <- brm(y ~ 0 + Intercept, # `0 + Intercept` avoids centering
prior =
c(prior(normal(0,2), class = b), # Our prior on intercept
prior(constant(1), class = sigma)), # Fix sigma to a constant
data = data.frame(y = y),
iter = 10000,
sample_prior = "yes", # Needed to compute BF
refresh = 0, silent = TRUE,
file = paste0(cache_dir, "/fit"), # Cache the results
file_refit = "on_change")
```

Hypothesis then gives us two numbers, the Bayes factor of null over intercept (\(BF_{12}\)), a.k.a. evidence ratio and the posterior probability that null model generated the data \(P(\mathcal{M}_1 | \mathbf{y})\):

```
hyp_res <- hypothesis(fit_brms, "Intercept = 0")
bf_brms <- hyp_res$hypothesis$Evid.Ratio
prob_null_brms <- hyp_res$hypothesis$Post.Prob
res_brms <- data.frame(method = "brms", bf = bf_brms, prob_null = prob_null_brms)
res_brms %>% table_format
```

method | bf | prob_null |
---|---|---|

brms | 3.898489 | 0.7958554 |

Those two qunatities happen to share a pretty straightforward relationship, given the prior probabilities of the individual models \(P(\mathcal{M}_1)\), \(P(\mathcal{M}_2)\) i.e.:

\[
P(\mathcal{M}_1 | \mathbf{y}) = \frac{BF_{12}P(\mathcal{M}_1)}{BF_{12}P(\mathcal{M}_1) + BF_{22}P(\mathcal{M}_2)} = \\
=
\frac{BF_{12}P(\mathcal{M}_1)}{BF_{12}P(\mathcal{M}_1) + P(\mathcal{M}_2)}
\]
Where \(BF_{22} = 1\) is the Bayes factor of the intercept model, cruel queen of the damned, against itself.
If the prior is uniform, i.e. \(P(\mathcal{M}_1) = P(\mathcal{M}_2)\) - as is assumed by `brms`

- the formula further simplifies to:

\[ P(\mathcal{M}_1 | \mathbf{y}) = \frac{BF_{12}}{BF_{12} + 1} \]

The Bayes factor might look a bit like a magic number, so let us demystify it with two more transparent approaches that give us the same Bayes factor.

One way to define Bayes factor is as the ratio of *evidence* i.e.:

\[ BF_{12} = \frac{P(\mathbf{y} | \mathcal{M}_1)}{P(\mathbf{y} | \mathcal{M}_2)} \]

Where “evidence” is exactly the prior density of the data after integrating out all the parameters. Since we have chosen very simple models, there is an analytic formula for evidence for both models.

As is usually done for numerical stability, we will primarily operate on the log scale. Computing the evidence for the null model is straightforward - there are no parameters to integrate out, so we can directly compute:

```
prior_log_evidence_null <- sum(dnorm(y, mean = 0, sd = 1, log = TRUE))
prior_log_evidence_null
```

`## [1] -4.130754`

The evidence for the intercept model, prophet of the apocalypse, is *slightly* more involved.
The main idea is that we can see the data as sum of two multivariate normals,
one with perfect correlation (corresponding to \(K\) identical copies of \(\alpha\)) and
one with no correlation (corresponding to noise):

\[ \mathcal{M}_2: \mathbf{y} = \mathbf{a} + \mathbf{e} \\ \mathbf{a} \sim MVN(0, \mathbf{\Sigma_a}), \forall_{i,j \in \{1, .., K\}} \Sigma_{a;i,j} = Var(\alpha) = 4 \\ \mathbf{e} \sim MVN(0, \mathbf{\Sigma_e}), \mathbf{\Sigma_{e}} = I \\ \]

where \(\mathbf{\Sigma_a}\) is a matrix with all elements set to \(Var(\alpha) = 4\) and \(\mathbf{\Sigma_e}\) is the identity matrix (1 on diagonal, 0 elsewhere). From the properties of multivariate normal we know that a sum of multivariate normals is obtained by simply adding both means and covariance matrices together, i.e.:

\[ \mathcal{M}_2: \mathbf{y} \sim MVN(0, \mathbf{\Sigma}) \\ \forall_{i \in \{1, .., K\}}\mathbf{\Sigma}_{i,i} = 5 \\ \forall_{i \neq j \in \{1, .., K\}} \mathbf{\Sigma}_{i,j} = 4 \]

where \(\mathbf{\Sigma} = \mathbf{\Sigma_a} + \mathbf{\Sigma_e}\). With that we can compute the evidence for the intercept model, the harbinger of sorrow, \(P(\mathbf{y} | \mathcal{M}_2)\) as the corresponding multivariate normal density:

```
prior_Sigma <- diag(length(y)) + matrix(4, nrow = length(y), ncol = length(y))
prior_log_evidence_intercept <-
mvtnorm::dmvnorm(y, mean = rep(0, length(y)), sigma = prior_Sigma, log = TRUE)
prior_log_evidence_intercept
```

`## [1] -5.452067`

The Bayes factor is then simply the ratio of the evidence and we can use the BF to compute the posterior probability of the null model, assuming once again uniform prior.

```
bf_prior <- exp(prior_log_evidence_null - prior_log_evidence_intercept)
prob_null_prior <- bf_prior / (bf_prior + 1)
res_prior <- data.frame(method = "prior", bf = bf_prior, prob_null = prob_null_prior)
rbind(res_prior, res_brms) %>% table_format
```

method | bf | prob_null |
---|---|---|

prior | 3.748338 | 0.7894000 |

brms | 3.898489 | 0.7958554 |

We see the results happen to match pretty closely with the value obtained via `brms`

.
The difference can be attributed to sampling error in the `brms`

computation.

The perspective via prior predictive distribution lays bare why the Bayes factor is so sensitive to choice of priors - priors are quite literally the only thing that matters. We may also note that in computing the Bayes factor we decide to not use the data to inform the parameters of the model. This might be desirable for people who really fear “double dipping” - it makes some sense to test two competing hypothesis without letting the data inform either of the models. But it IMHO only makes sense when your hypotheses are precise enough (i.e. have narrow priors) that you don’t really need additional data to constrain them.

If your priors are wide and your hypotheses imprecise, Bayes factor may not give you anything useful (as is neatly shown in the Schad et al. paper, which you should read).

A third perspective is provided by imagining a “supermodel” combining our two contestants, i.e.:

\[ \mathcal{M}_{super}: \mathbf{y} = \{y_1, ... , y_K\} \\ z \in \{1,2\}; P(z = 1) = P(z = 2) = \frac{1}{2}\\ y_i \sim N(\mu, 1) \\ \mu = \begin{cases} 0 & \mathrm{if} \quad z = 1 \\ \alpha & \mathrm{if} \quad z = 2 \end{cases} \\ \alpha \sim N(0,2) \] So we have introduced a new random variable \(z\) that works as an index over the two models and gave it a discrete uniform prior.

If we fit this model to the data, then \(P(\mathcal{M}_1 | \mathbf{y}) = P(z = 1 | \mathbf{y})\), so we can use the fitted distribution of \(z\) to get the posterior probability of the models which (as shown above) we can transform into a Bayes factor.

Before showing how to do this in Stan, we can use JAGS to implement this model very directly, if not very efficiently:

`suppressPackageStartupMessages(library(rjags))`

`## Warning: package 'rjags' was built under R version 4.3.3`

```
data_list <- list(
y = y)
# The model specification, note that JAGS uses precision
# to parametrize normal distribution
# Additionally we have z_0 = z - 1
jags_model_string <- "model {
z_0 ~ dbern(0.5)
alpha ~ dnorm(0, 0.25)
for(i in 1:length(y) ) {
mu[i] = ifelse(z_0 == 0, 0, alpha)
y[i] ~ dnorm(mu[i], 1)
}
}"
params_to_monitor <- c("z_0", "alpha")
# Running the model
model <- jags.model( textConnection(jags_model_string), data_list,
n.chains = 4, n.adapt= 1000)
```

```
## Compiling model graph
## Resolving undeclared variables
## Allocating nodes
## Graph information:
## Observed stochastic nodes: 4
## Unobserved stochastic nodes: 2
## Total graph size: 12
##
## Initializing model
```

```
update(model, 1000); # Burning 1000 samples to the MCMC gods...
mcmc_samples <- coda.samples(model, params_to_monitor, n.iter=10000, thin = 10)
# Extract values and check diagnostics
alpha_array <- cbind(mcmc_samples[[1]][, "alpha"], mcmc_samples[[2]][, "alpha"],
mcmc_samples[[3]][, "alpha"], mcmc_samples[[4]][, "alpha"])
posterior::rhat(alpha_array)
```

`## [1] 1.00007`

`posterior::ess_bulk(alpha_array)`

`## [1] 3802.998`

```
z_array <- cbind(mcmc_samples[[1]][, "z_0"], mcmc_samples[[2]][, "z_0"],
mcmc_samples[[3]][, "z_0"], mcmc_samples[[4]][, "z_0"])
posterior::rhat(z_array)
```

`## [1] 1.000246`

`posterior::ess_bulk(z_array)`

`## [1] 3962.832`

Posterior probability of the data being from the null model (\(P(\mathcal{M}_1 | \mathbf{y})\)), is then simply the \(P(z = 1 | \mathbf{y}) = P(z_0 = 0 | \mathbf{y})\) which can be directly extracted from the samples.

```
prob_null_jags <- mean(z_array == 0)
bf_jags <- prob_null_jags / (1 - prob_null_jags)
res_jags <- data.frame(method = "super_jags", bf = bf_jags,
prob_null = prob_null_jags)
rbind(res_jags, res_prior, res_brms) %>% table_format()
```

method | bf | prob_null |
---|---|---|

super_jags | 3.711425 | 0.7877500 |

prior | 3.748338 | 0.7894000 |

brms | 3.898489 | 0.7958554 |

We once again see a good agreement.

But we obviously want to do this in Stan. But Stan doesn’t handle discrete parameters you say? This can be overcome!

*Warning: I had a lot of trouble thinking clearly about the rest of this section, so it is possible, there are mistakes or stuff is confusing. Please let me know, if you find the reasoning problematic/unclear/unconvincing.*

The important point is called marginalization and there is a neat intro to it in the preprint for the `rater`

package. More discussion and examples are in the Stan User’s guide. The point is that we can express the unnormalized density \(p(\alpha, \mathbf{y})\) as:

\[ \mathcal{M}_{super-marginalized}: \\ p(\alpha, \mathbf{y}) = p(\alpha)p(\mathbf{y}| \alpha) \\ p(\mathbf{y}| \alpha) = p(\mathbf{y} | \alpha, z = 1)p(z = 1) + p(\mathbf{y} | \alpha, z = 1)p(z = 0) \] Now we have a density for \(y\) that is a combination of two normal densities and does not have \(z\) as a free parameter and can thus be written in Stan as:

`cat(readLines("2021-bayes-factors.stan"), sep = "\n")`

```
data {
int K;
vector[K] y;
}
parameters {
real alpha;
}
transformed parameters {
real log_lik_null = normal_lpdf(y | 0, 1);
real log_lik_intercept = normal_lpdf(y | alpha, 1);
}
model {
target += log_mix(0.5, log_lik_null, log_lik_intercept);
target += normal_lpdf(alpha | 0, 2);
}
```

`super_model <- cmdstan_model("2021-bayes-factors.stan")`

But how do we get back to the value we are actually interested in, that is \(P(z = 1 | \mathbf{y})\) when there no longer is any \(z\)?

We can use the Bayes rule and since \(z\) and \(\alpha\) are a-prior independent, we can assume \(P(z = 1 | \alpha) = P(z = 1) = \frac{1}{2}\). This gives:

\[ P(z = 1 | \mathbf{y}, \alpha) =\\ = \frac{P(z = 1 | \alpha) \times p(\mathbf{y} | z = 1, \alpha)}{P(z = 1 | \alpha) \times p(\mathbf{y} | z = 1, \alpha) + P(z = 2 | \alpha) \times p(\mathbf{y} | z = 2, \alpha)} = \\ =\frac{\frac{1}{2} p(\mathbf{y} | z = 1)}{\frac{1}{2} p(\mathbf{y} | z = 1) + \frac{1}{2} p(\mathbf{y} | z = 2, \alpha)} \]

So to get the desired \(P(z = 1 | \mathbf{y})\) I need to integrate out \(\alpha\), which I can approximate via posterior samples:

\[ P(z = 1 | \mathbf{y}) = \int P(z = 1 | \mathbf{y}, \alpha) \mathrm{d} \alpha \simeq \frac{1}{M} \sum_m P(z = 1 | \mathbf{y}, \alpha = \alpha_m) \]

So let us fit the model:

```
fit_super <- super_model$sample(
data = list(K = length(y), y = y), refresh = 0, iter_sampling = 10000)
```

```
## Running MCMC with 4 chains, at most 12 in parallel...
##
## Chain 1 finished in 0.3 seconds.
## Chain 2 finished in 0.3 seconds.
## Chain 3 finished in 0.2 seconds.
## Chain 4 finished in 0.3 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 0.3 seconds.
## Total execution time: 0.5 seconds.
```

`fit_super$summary() %>% table_format()`

variable | mean | median | sd | mad | q5 | q95 | rhat | ess_bulk | ess_tail |
---|---|---|---|---|---|---|---|---|---|

lp__ | -6.5619847 | -6.4162500 | 0.847365 | 0.7940064 | -8.159686 | -5.703508 | 1.000154 | 11044.62 | 12052.58 |

alpha | 0.0490804 | 0.0948292 | 1.793511 | 1.4701640 | -3.034274 | 3.098635 | 1.000182 | 11120.88 | 11475.93 |

log_lik_null | -4.1307500 | -4.1307500 | 0.000000 | 0.0000000 | -4.130750 | -4.130750 | NA | NA | NA |

log_lik_intercept | -10.5246013 | -6.0134100 | 10.546830 | 2.8471109 | -32.070390 | -4.041240 | 1.000103 | 10990.77 | 12359.95 |

No warnings, `rhat`

and `ess`

look good. Now, let’s extract the partial likelihoods
and combine them into the BMA weight:

```
super_samples <-posterior::as_draws_matrix(fit_super$draws())
rel_p_null <- exp(super_samples[,"log_lik_null"])
rel_p_intercept <- exp(super_samples[,"log_lik_intercept"])
prob_null_stan <- mean(rel_p_null / (rel_p_null + rel_p_intercept))
bf_stan <- prob_null_stan / (1 - prob_null_stan)
res_stan <- data.frame(
method = "super_stan", bf = bf_stan, prob_null = prob_null_stan)
rbind(res_stan, res_jags, res_prior, res_brms) %>% table_format()
```

method | bf | prob_null |
---|---|---|

super_stan | 3.753493 | 0.7896284 |

super_jags | 3.711425 | 0.7877500 |

prior | 3.748338 | 0.7894000 |

brms | 3.898489 | 0.7958554 |

By marginalizing, we are not only having a model with fewer parameters (and hence simpler to sample), we are also able to better resolve tails: if I have 1000 samples of \(z\), I cannot reliably estimate if \(P(z = 1) < 10^{-3}\) - I will just see that none (or almost none) samples are \(1\). After marginalizing, each sample gives me directly a sample of \(P(z = 1)\), so I can get reasonable precision even if \(P(z = 1)\) is very close to \(0\) or \(1\).

Viewing Bayes factors as derived from posterior probabilities of an index over candidate models (\(z\) in our case), puts a different angle on the distinction between \(\mathcal{M}\)-closed scenarios (the true model is among the models we compare) and \(\mathcal{M}\)-open / \(\mathcal{M}\)-complete where this is not the case.

If we treat the model selection as fitting the super model with the candidates as submodels, being in an \(\mathcal{M}\)-open scenario simply means that the super model is misspecified. Unfortunately, as discussed above, Bayes factors can be very sensitive to both prior and likelihood specification, where other approaches, e.g. approximate leave-one-out cross validation with `loo`

is usually less sensitive to priors and less often overconfident in face of misspecifaction (see e.g. Using Stacking to Average Bayesian Predictive Distributions by Yao et al. for more discussion).

But we have many good ways to detect model misspecification, including prior and posterior predictive checks and other steps outlined in the Bayesian workflow preprint. So it would be interesting to test whether there is something to be gained by running prior and posterior predictive checks (or other diagnostics), not only for the component models, but for the “supermodel” implied by the selection scheme. But I’ll that to people who actually want to use Bayes factors :-D.

I’ll further note that I tried to implemented a supermodel with two realistic models and it appears that such supermodels can be challenging to fit (my impression is that the problem is primarily in initialization, but I didn’t investigate thoroughly).

Above, we computed the Bayes factors only for one dataset, so you might not be convinced that the agreement holds for any dataset. So here we’ll encapsulate the code in a few functions and evaluate a bit larger sample.

```
bf_brms <- function(y) {
fit_new <- update(fit_brms, newdata = data.frame(y = y), refresh = 0)
hyp_res <- hypothesis(fit_new, "Intercept = 0")
bf_brms <- hyp_res$hypothesis$Evid.Ratio
prob_null_brms <- hyp_res$hypothesis$Post.Prob
data.frame(method = "brms", bf = bf_brms, prob_null = prob_null_brms)
}
bf_prior <- function(y) {
prior_log_evidence_null <- sum(dnorm(y, mean = 0, sd = 1, log = TRUE))
prior_Sigma <- diag(length(y)) + matrix(4, nrow = length(y), ncol = length(y))
prior_log_evidence_intercept <-
mvtnorm::dmvnorm(y, mean = rep(0, length(y)), sigma = prior_Sigma, log = TRUE)
bf_prior <- exp(prior_log_evidence_null - prior_log_evidence_intercept)
prob_null_prior <- bf_prior / (bf_prior + 1)
data.frame(method = "prior", bf = bf_prior, prob_null = prob_null_prior)
}
bf_super_stan <- function(y) {
fit_super <- super_model$sample(data = list(K = length(y), y = y), refresh = 0,
iter_sampling = 10000)
super_samples <-posterior::as_draws_matrix(fit_super$draws())
rel_p_null <- exp(super_samples[,"log_lik_null"])
rel_p_intercept <- exp(super_samples[,"log_lik_intercept"])
p_null <- mean(rel_p_null / (rel_p_null + rel_p_intercept))
bf <- p_null / (1 - p_null) # The Bayes factor
data.frame(method = "super_Stan", bf = bf, prob_null = p_null)
}
bf_super_jags <- function(y) {
data_list <- list(
y = y)
params_to_monitor <- c("z_0", "alpha")
# Running the model
model <- jags.model( textConnection(jags_model_string), data_list,
n.chains = 4, n.adapt= 1000)
update(model, 1000); # Burning 1000 samples to the MCMC gods...
mcmc_samples <- coda.samples(model, params_to_monitor, n.iter=10000, thin = 10)
z_array <- cbind(mcmc_samples[[1]][, "z_0"], mcmc_samples[[2]][, "z_0"],
mcmc_samples[[3]][, "z_0"], mcmc_samples[[4]][, "z_0"])
prob_null_jags <- mean(z_array == 0)
prob_null_jags
bf_jags <- prob_null_jags / (1 - prob_null_jags)
bf_jags
data.frame(method = "super_jags", bf = bf_jags, prob_null = prob_null_jags)
}
```

```
comparisons_file <- paste0(cache_dir, "/comparison.rds")
if(!file.exists(comparisons_file)) {
set.seed(3384528)
res_list <- list()
for(i in 1:50) {
use_null <- (i %% 2) == 0
N <- 2 + rnbinom(1, mu = 10, size = 1)
if(use_null) {
mu <- 0
} else {
mu <- rnorm(1, 0, 2)
}
y <- rnorm(10, mu, sd = 1)
res_list[[i]] <- rbind(bf_prior(y),
bf_brms(y),
bf_super_jags(y),
bf_super_stan(y))
res_list[[i]]$id = i
res_list[[i]]$null_true = use_null
res_list[[i]]$N = N
}
res <- do.call(rbind, res_list)
saveRDS(res, comparisons_file)
} else {
res <- readRDS(comparisons_file)
}
```

We note that some Bayes factors - exclusively those computed by the unmarginalized JAGS model are 0. This is unsurprising, as the 10000-iteration JAGS model loses ability to capture very low null probabilities and hence low Bayes factors - looking at the analytically computed Bayes factors of those problematic results they are invariably very low.

```
# Contrast the BFs with the "prior" (analytically computed)
comparisons_to_plot <- res %>% group_by(id) %>%
mutate(bf_prior = bf[method == "prior"],
prob_null_prior = bf[method == "prior"]) %>%
ungroup() %>%
filter(method != "prior")
zero_bf <- comparisons_to_plot %>% filter(bf == 0)
if(!identical(unique(zero_bf$method), "super_jags")){
stop("Bad assumption")
}
nrow(zero_bf)
```

`## [1] 11`

`zero_bf$bf_prior`

```
## [1] 1.242613e-37 3.423453e-23 3.619076e-09 1.497207e-21 4.917385e-10
## [6] 2.431973e-14 2.176901e-20 2.514081e-53 1.366711e-13 8.921497e-07
## [11] 1.321580e-05
```

We’ll exclude those invalid Bayes factors from further consideration.

`comparisons_to_plot <- comparisons_to_plot %>% filter(bf > 0)`

We can now plot the results of the various Monte carlo methods against the analytical results:

```
base_comparisons_plot <- function(comparisons, x, y, trans, breaks = waiver()) {
comparisons %>%
sample_frac() %>% #random reorder
ggplot(aes(x = {{x}}, y = {{y}}, color = method, shape = method)) +
geom_abline(slope = 1, intercept = 0, color = "black") +
geom_point(size = 3, alpha = 0.8) +
guides(color = guide_legend(override.aes = list(alpha = 1))) +
scale_x_continuous(trans = trans, breaks = breaks) +
scale_y_continuous(trans = trans, breaks = breaks)
}
base_comparisons_plot(comparisons_to_plot, bf_prior, bf, trans = "log",
breaks = c(1e-45, 1e-30,1e-15, 1))
```

We immediately see that only the marginalized Stan model keeps high agreement for the very low Bayes factors (yay marginalization!).

We can also zoom in on the upper-right area, where we see pretty good agreement between all methods.

```
base_comparisons_plot(comparisons_to_plot %>% filter(bf_prior > 1e-3),
bf_prior, bf, trans = "log",
breaks = c(1e-3, 1e-2,1e-1, 1, 10))
```

Following the method outline by Schad et al. we also check the calibration of our Bayes factors. Since the model has analytical solution, our simulations are going to be much cheaper than in actual practice and we can do a lot of them.

```
N_calibration <- 20 # How many values in a dataset
set.seed(5487654)
res_cal_list <- list()
for(i in 1:5000) {
use_null <- (i %% 2) == 0
if(use_null) {
mu <- 0
} else {
mu <- rnorm(1, 0, 2)
}
y <- rnorm(N_calibration, mu, sd = 1)
res_cal_list[[i]] <- bf_prior(y)
res_cal_list[[i]]$null_true = use_null
}
calibration <- do.call(rbind, res_cal_list)
```

As a quick heuristic, reflecting some common usage, we will interpret BF > 3 as weak evidence and BF > 10 as strong evidence. If we do this, this is how our results look like based on whether the null is actually true:

```
calibration %>% group_by(null_true) %>%
summarise(strong_null = mean(bf >= 10),
weak_null = mean(bf >= 3 & bf < 10),
no_evidence = mean(bf < 3 & bf > 1/3),
weak_intercept = mean(bf <= 1/3 & bf > 0.01),
strong_intercept = mean(bf <= 0.01), .groups = "drop") %>%
table_format()
```

null_true | strong_null | weak_null | no_evidence | weak_intercept | strong_intercept |
---|---|---|---|---|---|

FALSE | 0 | 0.1364 | 0.0852 | 0.0924 | 0.686 |

TRUE | 0 | 0.8572 | 0.1328 | 0.0100 | 0.000 |

We see that in the case of observing 20 values, this simple heuristic makes a potentially non-intuitive trade-off where low rates of wrongly claiming support for the intercept model, the maternal uncle of despair, are balanced by low rates of correctly finding strong support for the null model and by somewhat large rate of finding weak evidence in favor of the null model, even when this model was not used to simulate data.

This is just to illustrate the point made by Schad et al. that explicit calibration and decision analysis should be done in the context of a given experimental design and utility/cost of actual decisions. But wan can be safe in the knowledge that the poor null model is unlikely to suffer unjust rejections in this case.

Bayes factors are not very intuitive, but I hope that understanding that the same number can be understood as either being a ratio of prior predictive densities or from a larger model taking the candidate models as components could help improve the intuition. In line with the (better, more thoroughly done) results of Schad et al. we also observe that computation of Bayes factors cannot be taken for granted and that simple heuristics to interpret Bayes factors may have non-obvious implications.

Now go read Workflow Techniques for the Robust Use of Bayes Factors!

**UPDATE:** I have a followup post discussing the connection between Bayes factors and cross-validation on the same examples.

Obviously, one could add a lot of ` `

into the source text, but that seemed
tedious and prone to error. And it turns out one can do this automatically - with
*Lua filters*.

The thing is that RMarkdown uses Pandoc to convert (non-R) markdown to the target formats. And Pandoc let’s you manipulate an intermediate representation of the text using Lua - a tiny little language that is commonly used as scripting language in video games. You can read more about Lua filters at R Markdown cookbook and Pandoc documentation. There is also a very nice and practical article by Ulrik Lyngs.

With this knowledge, we can write a filter that checks for spaces after single-letter words and after a certain abbreviations and replaces them with an appropriate representation of non-breaking space (currently supporting only HTML and Latex).

Lua syntax is probably slightly unfamiliar, but I hope you will be able to
customize the script to suit your RMarkdown needs. Note that `--`

marks comments.

So here are the contents of `non_breaking_policy.lua`

on my computer:

```
-- Returns NBSP in appropriate output format
local function non_breaking_space()
if FORMAT:match 'html' then
return(pandoc.RawInline("html", " "))
elseif FORMAT:match 'latex' then
return(pandoc.RawInline("latex", "~"))
else
error("Unsupported format for non_breaking_policy.lua")
end
end
-- Other strings to force a nbsp after
-- Should be all lowercase
local additional_strings = {
["(tj."] = true,
["tj."] = true,
["tzv."] = true
}
-- Should return true if spc is a space that should be replaced by non-breaking
-- space. txt is the element before space
local function require_non_breaking_space(txt, spc)
return spc and spc.t == 'Space'
and txt and txt.t == 'Str'
and (txt.c:len() == 1 or additional_strings[pandoc.text.lower(txt.c)])
end
-- Iterate over list of content elements and replace spaces as needed
function replace_spaces (content)
for i = #content-1, 1, -1 do
if require_non_breaking_space(content[i], content[i+1]) then
content[i+1] = non_breaking_space()
end
end
return content
end
function replace_spaces_content (s)
s.content = replace_spaces(s.content)
return s
end
function replace_spaces_caption(s)
s.caption = replace_spaces(s.caption)
return s
end
-- In theory, we should be able to filter all inline text elemnts with:
-- return {{ Inlines = replace_spaces }}
-- But for some reason, I couldn't make it work, so explicitly lising elements
-- whose contents should be transformed.
return {{ Para = replace_spaces_content,
Header = replace_spaces_content,
LineBlock = replace_spaces_content,
Plain = replace_spaces_content,
Emph = replace_spaces_content,
Caption = replace_spaces_content,
Link = replace_spaces_content,
Quoted = replace_spaces_content,
SmallCaps = replace_spaces_content,
Span = replace_spaces_content,
Strikeout = replace_spaces_content,
Strong = replace_spaces_content,
Underline = replace_spaces_content,
Image = replace_spaces_caption
}}
```

The Lua filter can then be used in an RMarkdown file as:

```
---
output:
html_document:
pandoc_args: ["--lua-filter=non_breaking_policy.lua"]
---
```

Hope that’s useful to somebody.

]]>- The Approximation - Big Picture
- Saddlepoint for Sum of NBs
- Implementing the Approximation in Stan
- A Simple Baseline
- Eyeballing Masses
- Evaluating Performance
- Summing up
- Saddlepoint Approximations for Other Families

I recently needed to find the distribution of sum of non-identical but independent negative binomial (NB) random variables. Although for some special cases the sum is itself NB, analytical solution is not feasible in the general case. However it turns out there is a very handy tool called “Saddlepoint approximation” that is useful whenever you need densities of sums of arbitrary random variables. In this post I use the sum of NBs as a case study on how to derive your own approximations for basically any sum of independent random variables, show some tricks needed to get the approximation working in Stan and evaluate it against simpler approximations. To give credit where credit is due, I was introduced to the saddlepoint method via Cross Validated answer on sum of Gamma variables.

Spoiler: it turns out the saddlepoint approximation is not that great for actual inference (at least for the cases I tested), but it is still a cool piece of math and I spent too much researching it to not show you this whole post.

The saddlepoint approximation uses the cumulant-generating function (CGF) of a distribution to compute an approximate density at a given point. The neat part about CGFs is that the CGF of the sum of several variables is the sum of the individual CGFs! And CGFs are easy to come by, because the CGF is just the log of the moment-generating function and Wikipedia helpfully lists moment-generating functions for almost all distributions. Figuring out the CGF of almost any sum variable (including variables from different families) is thus relatively straightforward. The CGF can also easily be derived for general linear combinations of random variables.

The actual method for approximating density \(f\) at point \(x\), given the cumulant-generating function \(K\), and its first and second derivatives (\(K^\prime,K^{\prime\prime}\)) is as follows:

- find the saddlepoint \(s_x\) by solving:

\[ K^\prime(s_x) = x \]

Generally, there is no closed-form solution for \(s_x\), but since \(K(x)\) is always convex, \(K^\prime\) is always increasing, making it a nice target for numerical solutions. Still, since a different solution is needed for each \(x\), finding \(s_x\) tends to be a computational bottleneck.

- Once we have \(s_x\), we can approximate

\[ f(x) \simeq \frac1{\sqrt{2\pi K''(s_x)}} \exp(K(s_x) - x s_x) \]

The nice thing about the saddlepoint approximation is that it can easily produce approximations for both discrete and continous densities, and doesn’t constrain the approximation to be normal (unlike Laplace approximation). One thing to note is that the saddlepoint approximation in the form above does not necessarily integrate to 1, so a renormalization might be needed if you are interested in the actual density. But to use in Stan, unnormalized density is all that’s needed.

The moment-generating function of NB distribution parametrized by number of failures \(r\) and probability of success \(p\) is:

\[ M(t) = \left( \frac{1 - p}{1 - p e^t} \right)^r \]

So, taking the log and summing over \(n\) independent NB variables, the cumulant of sum of NB is:

\[ K(t) = \sum_{i=1}^{n} r_i \left[ \log(1-p_i) - \log(1 - p_i e^t) \right] \]

We now transform to the more useful parametrization of NB via mean \(\mu\) and precision \(\phi\) (i.e. \(Var(X) = \mu + \frac{\mu^2}{\phi}\)), where we have:

\[ r_i = \phi_i \\ p_i = \frac{\mu_i}{\phi_i + \mu_i} \\ K(t) = \sum_{i=1}^{n} \phi_i \left[ \log \frac{\phi_i}{\phi_i + \mu_i} - \log \left(1 - \frac{\mu_i e^t}{\phi_i + \mu_i} \right) \right] = \\ =\sum_{i=1}^{n} \phi_i \left[ \log(\phi_i) - \log(\phi_i + \mu_i ( 1 - e^t)) \right] \]

Note that \(K(t)\) does exist only when \(\forall i:\phi_i + \mu_i ( 1 - e^t) > 0\) this constrains \(t\) such that:

\[ \begin{align} \tag{*} \forall i : t &< log \left(\frac{\phi_i}{\mu_i} + 1 \right) \end{align} \]

The first and second derivatives of \(K\) are:

\[ K^\prime (t) = \sum_{i=1}^{n} \frac{\phi_i \mu_i e^t}{\phi_i + \mu_i (1 - e^t)} \\ K^{\prime\prime} (t) = \sum_{i=1}^{n} \frac{\phi_i \mu_i (\phi_i + \mu_i) e^t}{(\phi_i + \mu_i (1 - e ^t))^2} \\ \]

It turns out that the saddlepoint \(s_x\) is not defined when \(x = 0\), since the numerator of \(K^\prime(t)\) is positive for all \(t\) and the denominator has to be positive for \(K\) to exist. But for this special case, the density can be easily computed, as \(f(0) = \prod_i P(X_i =0) = \prod_i NB(0 | \mu_i,\phi_i)\). The non-existence of the saddlepoint solution for boundaries of the domain is actually a recurring theme, as the existence of the solution is guaranteed only for the inner points, so it is useful to check for this when developing your approximations.

This has all been a nice math excercise, but how can we translate that into a piece of code we could use? The only problematic part is solving for \(s_x\), once we have it, the rest is a simple math that Stan will digest easily. Luckily, Stan has the built-in `algebra_solver`

that can solve equations AND provide derivatives of the solution wrt. parameters. There is only a minor problem - we have an upper bound on \(s_x\) from the equation \((*)\) and `algebra_solver`

turns out not to work when there are boundaries (even when initialized within the boundaries). Instead we use the same method Stan uses for bounds on parameters and solve for unbounded \(y_x\) where:

\[ s_x = \min_i{log \left(\frac{\phi_i}{\mu_i} + 1 \right)} -e^{y_x} \]

So let us get our hands dirty and show some code, starting with how to write the saddlepoint equation in a way that the `algebra_solver`

can handle. Since \(K^\prime\) is always positive, we transform the equation to log scale - partly because we might have some big \(\sum\mu_i\) out there and partly because it seems nice - I didn’t test the non-log version. So the equation we are actually solving for \(s_x\) is:

\[ \log \sum_{i=1}^{n} \exp \left( \log\phi_i + \log \mu_i + s_x - \log(\phi_i + \mu_i - \mu_i \exp(s_x) \right) - x = 0 \] Translated into Stan we get:

```
//Compute upper bound on s - there is a singularity at this point
real max_s(vector mus, vector phis) {
return min(log(phis ./ mus + 1));
}
//Transform from unbounded y to s upper-bounded by max_s
vector s_transform(vector y, vector mus, vector phis) {
return -exp(y) + max_s(mus, phis);
}
vector nb_sum_log_Kd_eq(vector y, vector theta, real[] x_r, int[] x_i) {
int G = rows(theta) / 2;
vector[G] mus = theta[1:G];
vector[G] phis = theta[(G + 1) : (2 * G)];
real s = s_transform(y, mus, phis)[1];
real sum_y = x_i[1];
vector[G] log_phis_mus = log(phis) + log(mus);
vector[G] phis_mus = phis + mus;
real value = log_sum_exp(log_phis_mus + s - log(phis_mus - mus * exp(s))) - log(sum_y);
return to_vector({value});
}
```

Above, `y`

are the unconstrained unknowns, which we transform via `s_transform`

to the constrained space. Further we extract \(\mu_i\) and \(\phi_i\) from `theta`

which can be parameters while `x_i`

contains the observed sums (data). Since we have no real number data, `x_r`

is ignored. The `algebra_solver`

will try to find `y`

such that `value`

is 0 which is exactly when `s`

is the solution to the saddlepoint equation.

We use the `nb_sum_log_Kd_eq`

function to compute the actual saddlepoint density:

```
real neg_binomial_sum_saddlepoint_lpmf(int[] sum_y, vector mus, vector phis, real[] dummy_x_r) {
int N = size(sum_y);
int G = rows(mus);
// Solve the saddlepoint equation
vector[2 * G] solver_params = append_row(mus, phis);
vector[N] s_vec_raw;
vector[1] solver_guess = to_vector({0});
for(n in 1:N) {
if(sum_y[n] != 0) {
//Saddlepoint is defined only for non-zero values
s_vec_raw[n] = algebra_solver(nb_sum_log_Kd_eq, solver_guess, solver_params, dummy_x_r, {sum_y[n]})[1];
} else {
//This will be ignored, but needed to pass to s_transform without problems
s_vec_raw[n] = 0;
}
}
{
vector[N] s = s_transform(s_vec_raw, mus, phis);
//Calculate the saddlepoint mass
vector[N] lpmf;
vector[G] log_mus = log(mus);
for(n in 1:N) {
if(sum_y[n] != 0) {
vector[G] log_denominator_s = log(phis + mus - mus * exp(s[n]));
real K_s = sum(phis .* (log(phis) - log_denominator_s));
real log_Kdd_s = log_sum_exp(log(phis) + log_mus + log(phis + mus) + s[n] - 2 * log_denominator_s);
lpmf[n] = -0.5 * (log(2*pi()) + log_Kdd_s) + K_s - s[n] * sum_y[n] ;
} else {
//For zero values, the probability is simply that of all NBs giving 0
lpmf[n] = neg_binomial_2_lpmf(rep_array(0, G) | mus, phis);
}
}
return sum(lpmf);
}
}
```

The above shows how the `algebra_solver`

is called - we combine \(\mu_i\) and \(\phi_i\) as params, pass a guess (0 works great, so we don’t need to worry about details). The only weird part is `dummy_x_r`

- I want it to be just an empty array, but it has to be of type `real`

and has to be data. And I didn’t find a way to make the compiler understand that unless I pass `dummy_x_r`

from outside as in

```
transformed data {
real dummy_x_r[0];
}
...
model {
sums ~ neg_binomial_sum_lpmf(mus, phis, dummy_x_r);
}
```

To assess, how useful the saddlepoint approximation is in practice, we’ll compare it to a straightforward application of Method of moments. This is just a fancy name for choosing a distribution family and choosing it’s parameters so that mean, variance (and possibly higher moments) match those of the desired distribution. In case of NBs, when \(Y_i \sim NB(\mu_i, \phi_i)\) then

\[ E \left(\sum Y_i \right) = \sum \mu_i \\ Var \left(\sum Y_i \right) = \sum \left( \mu_i + \frac{\mu_i^2}{\phi_i} \right) \]

Simply because both mean and variance are linear operators. Maybe sum of NBs isn’t that different from a NB distribution, so let’s approximate

\[ \sum Y_i \approx NB(\bar\mu, \bar\phi) \]

Solving for \(\bar\mu\) and \(\bar\phi\) by matching the mean and variance of the approximate distribution gives:

\[ \bar \mu = \sum \mu_i \\ \bar \phi = \frac{ \left(\sum \mu_i \right)^2 }{\sum \frac{\mu_i^2}{\phi_i}} \]

This can be implemented very directly in Stan as:

```
real neg_binomial_sum_moments_lpmf(int[] sum_y, vector mus, vector phis) {
real mu_approx = sum(mus);
real phi_approx = square(mu_approx) / sum(square(mus) ./ phis);
return neg_binomial_2_lpmf(sum_y | mu_approx, phi_approx);
}
```

As a first look, we will see how well do both approximations match the empirical mass function - we simulate a lot of sums of NBs, bin the range of observed values and compute empirical mass as the proportion of the samples that fits in each bin. For the approximations, we sum the mass for all values belonging to the bins.

The saddlepoint approximation improves notably over moments when the Fano factors of the summed variables are vastly different and we do not sum a large number of values, below we show mass and log mass for the case when \(\mu = \{800, 1600 \}\) and \(\phi = \{10, 1 \}\):

It is visible that the saddlepoint mass tracks the empirical mass very tightly both in the bulk and in the tail (visible better on the log mass) - note that the tail of the empirical log mass is jittery due to low number of samples in the tail.

On the other hand, when we sum a lot of variables which are not very different and/or when \(\phi_i\) are large, the sum becomes normal-ish and both approximation work well - let us for example look at the case when \(\mu = \{50, 100, 1300, 2000 \}\) and \(\phi = \{10, 10, 10, 10 \}\):

This gives us some intuition where to expect differences.

We evaluate the model using Simulation-based Calibration (SBC). The main idea is that when I generate data exactly the way the model assumes, then for any \(c\) the \(c\%\) posterior interval should contain the true value an unobserved parameter in exactly \(c\%\) of the cases. In other words the quantile in which the true value is found in the posterior distribution should be uniformly distributed. There are some caveats to this, read the paper for details.

I am using my own implementation of SBC which is in my not very well documented, likely-never-on-CRAN package `rstanmodeldev`

. We run 500 simulations for each of the test cases. If you want to see under the hood, the code for this post is available at the GitHub repo of this blog.

The first test case I will use is that I observe the sum of \(G+1\) variables where I know \(\mu_i\) and \(\phi_i\) for \(i \in {1 .. G}\) while \(\mu_{G+1}\) and \(\phi_{G+1}\) is unknown and has to be infered from \(N\) observations of the sum.

In all cases, both observed and unobserved \(\phi_i\) are drawn following the advice of Dan simpson, i.e.:

\[ \phi_{raw} \sim HalfN(0, 1) \\ \phi = \frac{1}{\sqrt{\phi_{raw}}} \]

This is how the model looks-like in Stan ( `test_sum_nb.stan`

):

```
#include /sum_nb_functions.stan
data {
int<lower=1> N;
int<lower=0> sums[N];
int<lower=1> G;
vector[G] mus;
vector[G] phis;
//0 - saddlepoint, 1 - method of moments
int<lower = 0, upper = 1> method;
real mu_prior_mean;
real<lower = 0> mu_prior_sd;
}
transformed data {
real dummy_x_r[0];
}
parameters {
real log_extra_mu_raw;
real<lower=0> extra_phi_raw;
}
transformed parameters {
real<lower=0> extra_mu = exp(log_extra_mu_raw * mu_prior_sd + mu_prior_mean);
real<lower=0> extra_phi = inv(sqrt(extra_phi_raw));
}
model {
vector[G + 1] all_mus = append_row(mus, to_vector({extra_mu}));
vector[G + 1] all_phis = append_row(phis, to_vector({extra_phi}));
if(method == 0) {
sums ~ neg_binomial_sum_saddlepoint_lpmf(all_mus, all_phis, dummy_x_r);
} else {
sums ~ neg_binomial_sum_moments_lpmf(all_mus, all_phis);
}
log_extra_mu_raw ~ normal(0, 1);
extra_phi_raw ~ normal(0,1);
}
```

Most notably, the way the sum of NBs is implemented is given as data. The `sum_nb_functions.stan`

include contains the functions shown above.

And this is an R method to generate simulated data - this is a function that given parameters of the observed data gives a function that on each call generates both `true`

and `observed`

data in a format that matches the Stan model:

```
generator <- function(G, N, method = "saddlepoint", observed_mean_mus, observed_sd_mus, mu_prior_mean, mu_prior_sd) {
if(method == "saddlepoint") {
method_id = 0
} else if (method == "moments") {
method_id = 1
} else {
stop("Invalid method")
}
function() {
all_mus <- rlnorm(G + 1, observed_mean_mus, observed_sd_mus)
all_mus[G + 1] <- rlnorm(1, mu_prior_mean, mu_prior_sd)
all_phis <- 1 / sqrt(abs(rnorm(G + 1)))
sums <- array(-1, N)
for(n in 1:N) {
sums[n] <- sum(rnbinom(G + 1, mu = all_mus, size = all_phis))
}
list(
observed = list(
N = N,
sums = sums,
G = G,
method = method_id,
mus = array(all_mus[1:G], G),
phis = array(all_phis[1:G], G),
mu_prior_mean = mu_prior_mean,
mu_prior_sd = mu_prior_sd
),
true = list(
extra_mu = all_mus[G+1],
extra_phi = all_phis[G+1]
)
)
}
}
```

Here we test a sum of two NBs - the means of both observed and unobserved NB are chosen randomly from LogNormal(5,3) We observe 10 sums in each run.

First, let’s look at diagnostics for the saddlepoint approximation:

has_divergence | has_treedepth | has_low_bfmi | median_total_time | low_neff | high_Rhat |
---|---|---|---|---|---|

0.04 | 0.004 | 0 | 50.0525 | 0 | 0 |

All the columns except for `median_total_time`

represent proportion of fits that have a problem with divergences/treedepth/lowe n_eff etc. We see that some small number of runs ended with divergencies. This is not great, but we will ingore it for now. The `n_eff`

and `Rhat`

diagnostics are okay. We also note that the model is quite slow - 50 seconds for just 10 observations is high.

Let’s look at the SBC histogram at two resolutions:

Here we would like to see a uniform distribution. The gray area is a rough 99% confidence interval, so very few bars should actually be outside this. While the histogram for \(\mu_{G+1}\) looks OK, the consistent trend and several outliers for \(\phi_{G+1}\) indicates that the approximation has some problems and consistently underestimates the true value.

Finally we can look at a scatter plot of true value vs. posterior median:

The blue line indicates perfect match (true value = posterior median) As in the above plot, we see that \(\mu_{G+1}\) is inferred quite precisely, especially for larger true values, while the results for \(\phi_{G+1}\) are diffuse, often dominated by the priors (the prior density peaks at around 1.7) and have a slight tendency to be below the perfect prediction line. We also see that low true values of \(\mu_{G+1}\) tend to be overestimated - this is not unexpected as when the observed \(\mu\) is large and unobserved small it is hard to infer it’s exact value and the posterior is largely influenced by prior (which has large mean).

We can now do the same for the method of moments approximation, starting with the diagnostics:

has_divergence | has_treedepth | has_low_bfmi | median_total_time | low_neff | high_Rhat |
---|---|---|---|---|---|

0.02 | 0 | 0 | 0.6775 | 0.016 | 0.016 |

We see some small number of divergences and low n_eff and high Rhat (which go usually hand in hand). This is comparable to the saddlepoint case.

The histogram:

The histrograms look very slightly worse than the saddlepoint approximation - although there is no consistent trend, more bars are outside the confidence interval or close to the border, indicating some issues, although I honestly don’t really understand what is going on.

And the scatterplot, which looks quite similar to the saddlepoint version:

Further, we can check the case where there are 20 known variables with low means and one NB is unknown with a large mean - we want the unobserved mean to have notable influence on the total outcome, hence we choose it to be larger. In particular, the observed means are drawn from LogNormal(2,1) and the mean to be inferred is drawn from LogNormal(5,3)

Looking at the statisics, we see only very few divergences, but quite large median time:

has_divergence | has_treedepth | has_low_bfmi | median_total_time | low_neff | high_Rhat |
---|---|---|---|---|---|

0.002 | 0 | 0 | 212.1805 | 0 | 0 |

The histogram:

We see that especially for \(\phi_{G+1}\) the results are discouraging with the true value frequently being in the low quantiles of the posterior.

The scatterplot is than quite similar to the previous cases.

The statistics for moments show short running time but a larger amount of convergence issues:

has_divergence | has_treedepth | has_low_bfmi | median_total_time | low_neff | high_Rhat |
---|---|---|---|---|---|

0.016 | 0 | 0.008 | 1.0095 | 0.162 | 0.16 |

The histograms:

The histograms hint at consistent underestimation of \(\mu_{G+1}\) and overestimation of \(\phi_{G+1}\), problematic especially for \(\phi_{G+1}\).

Once again the scatter is similar, the only interesting feature are the few outliers for \(\mu_{G+1}\) where the true value is large but the posterior median is very small. Those likely correspond to the divergent runs, but they cannot account for the full skew of the SBC histograms - this is more likely caused by the string of underestimated points just below the blue line on the top right.

The first model is not very useful when `G`

is large, because the posterior gets dominated by the prior. To better test what happens with large `G`

, we instead us a single parameter to define all \(\mu_i\) and \(\phi_i\) as a geometric series, i.e. \(\mu_i = \mu_{base} k^{(i - 1)}\) where \(k\) is known while \(\mu_{base}\) is the unknown parameter, similarly for \(\phi_i\). The Stan code is:

```
#include /sum_nb_functions.stan
data {
int<lower=1> N;
int<lower=0> sums[N];
int<lower=1> G;
//0 - saddlepoint, 1 - method of moments
int<lower = 0, upper = 1> method;
real mu_prior_mean;
real<lower = 0> mu_prior_sd;
real<lower=0> mu_series_coeff;
real<lower=0> phi_series_coeff;
}
transformed data {
real dummy_x_r[0];
vector[G] mu_coeffs;
vector[G] phi_coeffs;
mu_coeffs[1] = 1;
phi_coeffs[1] = 1;
for(g in 2:G) {
mu_coeffs[g] = mu_coeffs[g - 1] * mu_series_coeff;
phi_coeffs[g] = phi_coeffs[g - 1] * phi_series_coeff;
}
}
parameters {
real log_mu_raw;
real<lower=0> phi_raw;
}
transformed parameters {
real<lower=0> mu = exp(log_mu_raw * mu_prior_sd + mu_prior_mean);
real<lower=0> phi = inv(sqrt(phi_raw));
}
model {
vector[G] all_mus = mu * mu_coeffs;
vector[G] all_phis = phi * phi_coeffs;
if(method == 0) {
sums ~ neg_binomial_sum_saddlepoint_lpmf(all_mus, all_phis, dummy_x_r);
} else {
sums ~ neg_binomial_sum_moments_lpmf(all_mus, all_phis);
}
log_mu_raw ~ normal(0, 1);
phi_raw ~ normal(0,1);
}
```

The R code for simulation is then:

```
generator_series <- function(G, N, method = "saddlepoint", mu_prior_mean, mu_prior_sd, mu_series_coeff, phi_series_coeff) {
if(method == "saddlepoint") {
method_id = 0
} else if (method == "moments") {
method_id = 1
} else {
stop("Invalid method")
}
function() {
mu <- rlnorm(1, mu_prior_mean, mu_prior_sd)
phi <- 1 / sqrt(abs(rnorm(1)))
all_mus <- mu * mu_series_coeff ^ (0:(G - 1))
all_phis <- phi * phi_series_coeff ^ (0:(G - 1))
sums <- array(-1, N)
for(n in 1:N) {
sums[n] <- sum(rnbinom(G, mu = all_mus, size = all_phis))
}
list(
observed = list(
N = N,
sums = sums,
G = G,
method = method_id,
mu_prior_mean = mu_prior_mean,
mu_prior_sd = mu_prior_sd,
mu_series_coeff = mu_series_coeff,
phi_series_coeff = phi_series_coeff
),
true = list(
mu = mu,
phi = phi
)
)
}
}
```

In the following we draw \(\mu_{base}\) from LogNormal(8, 3) and use `mu_series_coeff =`

0.75 and `phi_series_coeff`

= 0.9.

Once again let’s look at the diagnostics for saddlepoint approximation:

has_divergence | has_treedepth | has_low_bfmi | median_total_time | low_neff | high_Rhat |
---|---|---|---|---|---|

0.054 | 0 | 0 | 155.9085 | 0 | 0 |

We see quite a few divergences, but I didn’t investigate them in detail. The SBC histograms follow:

The histrograms hint at some problems for \(\mu\).

The scatterplot shows that the estimation is quite reasonable for both \(\mu\) and \(\phi\) - definitely better than the previous model, as we got rid of the cases where the data do not identify the true values well.

The diagnostics and plots for method of moments is:

has_divergence | has_treedepth | has_low_bfmi | median_total_time | low_neff | high_Rhat |
---|---|---|---|---|---|

0.042 | 0.006 | 0.004 | 0.7445 | 0.004 | 0.002 |

We see a bunch of problems, comparable to the saddlepoint version. Let’s look at the SBC histograms:

And those are surprisingly nice, showing no clear trend or outliers!

The scatterplot is very similar to the saddlepoint case.

We see that in the regimes we tested, the saddlepoint approximation for sum of negative binomials provides somewhat better inferences for small number of variables at the cost of much increased computation times. For sums of large number of variables, it may even be worse than the moments method. So it is probably not very practical unless you have few variables you need that extra bit of precision. But it is a neat mathematical trick and of interest on its own. It is also possible that for some low-mean regimes the difference is bigger.

If you want to use saddlepoint approximation for other than NB variables, but don’t want to do the math on your own, there are some worked out on the Internet:

- Sum of Gamma variables: Answer on Cross Validated
- Sum of binomials: Liu & Quertermous: Approximating the Sum of Independent Non-Identical Binomial Random Variables

Thanks for reading!

Spoiler: this post is mostly not about the Amrhein paper.

I studied computer science which included a mandatory course on probability and statistics. I vaguely remember being OK with probability, shaking my head at all the effort it took to show that averaging numbers makes sense and slightly wondering why we care about “unbiased estimators”. The lectures on basic hypothesis testing felt esoteric and I couldn’t wrap my head around it. As the examination neared, I did other stuff students do and ended up not even reading the hypothesis testing part of my notes. Of the four exam terms this year, the one I attended was the only one to not have any questions on hypothesis testing. I barely passed and quickly forgot. Later, I took a machine learning course and a neural networks course, but I didn’t notice any connection to the statistics stuff.

Around 2012, when finishing my masters I realized I have some data on the time various AI agents took to solve a task and I need to analyze them. My advisor recommended I read “Statistics Explained: An Introductory Guide for Life Scientists” by McKillup. This felt like a revelation. Suddenly some of the things I’ve heard about before made sense and I could understand why you would use a t-test.

But I didn’t really have two groups and couldn’t figure out how to extend the ideas from McKillup to my tasks, so I got a consultation with a statistician and I first heard of “linear models”. I didn’t understand what I was doing, but by trial, error and Googling I managed to call `glm`

with semi-sensible arguments and I got a bunch of p-values, lots of them even below 0.05! Being a confident guy, not understanding what a linear model is didn’t prevent me from writing things as:

… we consider lack of rigorous statistical analysis in game AI experiments an issue in the field.

I was proud of myself, I even showed a set of “nice” tables with effect sizes, p-values and highlighted significance:

*Effect size on top, p-value below. Doesn’t that warm your heart?*

Looking at the paper now, there is at least a faint relief that I did correct for multiple comparisons.

I didn’t need a lot of statistics for a while, but then, around 2014, a colleague sent me a link to The new statistics: why and how by Cumming. This felt like a revelation. I started hating p-values and thought everybody should preregister everything. I didn’t do a lot of empirical stuff at the time, but in 2016, I wanted to compare how people felt playing against several types of AI in a game. And I wanted to do it the The Right Way™. I’ve used aspredicted.org to preregister my study (just hours before I started collecting data). I made all the code and data available. I plotted individual data points. I averaged several Likert responses and used confidence intervals of multiple pairwise t-tests on those averages. Nothing could stop me, I knew how statistics should be done.

After PhD in mid 2016, I changed fields and started doing bioinformatics. Soon, I noticed that statistics might play a vital role in my new job, but everybody in the lab admitted they don’t really know anything about it. I googled the weird phrases that appeared in the literature I was digesting and I started reading random stuff that popped up. I realized there is a connection between least squares and normal distribution. I noticed some people tend to use the word “Bayes” a lot and they say nice things about it. My colleague from earlier told me that New statistics is outdone and all the cool kids use Bayes factors now. Then I stumbled on Andrew Gelman’s blog.

This felt like a revelation. In retrospect, this process of repeated revelations probably should have raised a red flag or two.

*Image courtesy of XKCD: https://xkcd.com/1027/*

Anyway, I’ve bought in with all the enthusiasm of a new convert. I’ve read several years of Andrew’s blog in full and I started to dislike Brian Wansink and Amy Cuddy. By mid 2017, I’ve attempted rewriting the gene expression model I used into Stan and I failed terribly. I tried to push Bayes in metatranscriptomics and failed. I tried analysing single-cell RNA-seq with Bayes and failed. I filed my first pull request against Stan’s codebase and later write my first actually useful model. It is telling that to write this post I’ve cannibalized a stub I pushed to GitHub a year ago, which had the title “Go Bayes: Why and how”.

I learned new stuff and was amazed, the stuff made sense! I finally took time to understand this “linear model” thing and become fluent in which distributions there are and what is a likelihood. I read The 100% CI, Data colada and others. I started to have strong opinions on preprints and academic publishing. I was on a fast trajectory to become an obnoxious critic of statistics in other people’s work and a loyal follower of the various heralds of scientific reform.

In some sense, Twitter has prevented that. I can’t exactly put my finger on it, but most likely, it was following Danielle Navarro that showed me a different social bubble. Some comments on Andrew’s blog were also certainly an influence. Whatever the tipping point, words like “kindness” and “community” entered my thinking about science. In some sense this was an easy shift as I have long experienced the power of being kind first-hand through my involvment with the Czech scout movement. I am a bit ashamed it took me so long to realize it could apply to science as well.

From a different angle, I noticed Daniël Laken’s blog and read up on Deborah Mayo’s work. And they made sense - who would have thought that frequentists weren’t all evil monsters delaying the inevitable victory of the One True Gospel of Reverend Thomas Bayes? Maybe I still have a lot to learn about methods and statistics and everything around that?

I worked on improving my knowledge of philosophy of science and quickly started feeling like it’s all a bit more complicated than I thought (surprise). These days, I work hard to include as many people as I can in constructive discussions about how to improve our methods. I am trying to find ways to make statistics understandable, to convey what models do in precise, but approachable language. I’ve stopped shaming people for their code or stats. I’ve attempted doing constructive post-publication peer review. I try to be a good mentor to anyone who asks for guidance I can provide, but not give unsolicited advice. I aim to talk less, listen more and more readily say “I don’t know”. And it’s hard, harder than the math part (which is still very hard). But it needs to be done - my current understanding is that people, communities and institutions are the most important part of any effort to improve science.

Around this point, it would be useful to clarify what this story is about. Is this a cautionary tale, telling how manifesto’s and rally cries can make people overzealous? Is this a story about my failure to see the bigger picture? Or is this an ecouragement that pieces with strong opinion can make people grow and continuously improve? Frankly, I am not sure.

It is true that I got to the point where I am through calls to action. I was wrong times and again, but everytime I learned new stuff and became at least a little less wrong - I finally came far enough to notice there might not be an end to this journey, no final and definitive answers, just a life spent trying hard to improve what we can.

And it is also true that I’ve been foolish many times in the past. I was quick to absorb new statistical religions, but I was wrong and sometimes annoying and likely also causing harm. And that’s something that makes me wary of the calls for reform, even though I think the Amrhein paper is well worded and better than most other rally cries I’ve read before. I am certainly sympathetic to the claim that a part of the problem with science is not that people don’t listen to statisticians, but that they listen all too well.

I don’t have answers, but I currently believe the way we treat each other and the way we frame our discussions is in at least a similar level of crisis as our statistics. This journey is not over and many obstacles still lie ahead.

However, I am certain that a lot of gratitude is due. I want to give my sincere thanks to people that helped me on my journey towards statistics and especially to people who work hard on improving what we do. To those that inglamorously solve one small problem after another. To those that sincerely try to improve their methods regardless of skill level. To those that care for each other and build islands of trust and kindness. Thank you very much.

I thank Danielle Navarro for alerting me to how much kindness matters in science and generally being a role model in writing and thinking.

I thank Dan Simpson for his relentless fear of being wrong and for being his fabulous self.

I thank Michael Betancourt for the deep insights into the math that I struggle to digest.

I thank Hadley Wickham for the idea of “Teaching Safe-Stats, Not Statistical Abstinence” and - obviously - for tidyverse.

I thank Berna Devezer for highlighting areas of philosophy of science that I could’ve missed otherwise.

I thank the Stan community for all the feedback and help in learning and also for the opportunity to give back.

I thank everyone who I forgot to include here - I am sorry that my memory is so weak. You are all great and I am in awe and feel priviliged to be able to learn from you.

]]>How would you write unit tests for data analysis? I feel it will be both tricky and unnecessary. […] On the other hand, data analysts often do tests in an informal way, too. As they explore the data, they may draw plots or create summary tables, in which they may be able to discover problems (e.g., wrong categories, outliers, and so on).

This reads as if there is no room for automated tests in markdown/notebooks. I respectfully disagree: automated tests and checks are IMHO vital for high-quality notebooks - whether we still call them unit tests or something else is besides the point.

Let me give you an example from a recent analysis I did. I needed to connect some data containing adresses to city populations from the census. The best binding seemed to be based on post code and municipality name as neither post code nor municipality names correspond to unique rows in the census data while the combination seemed sufficient. So I wrote a simple check (note that all code is shortened for clarity and not tested to actually work):

```
non_unique_rows <- census_data %>%
group_by(postcode, municipality) %>%
summarise(count = length(id)) %>%
filter(count > 1)
if(nrow(non_unique_rows) > 0) {
stop("Municipality and postcode do not identify census data")
}
```

And this actually gave me the error message, as the data was not - contrary to my belief - unique. It turns out however that the data under analysis never referenced those non-unique rows and the check thus became:

```
non_identified_data <- main_data %>%
select(municipality,postcode) %>%
semi_join(non_unique_rows,
by = c("postcode" = "postcode",
"municipality" = "municipality"))
if(nrow(non_identified_data) > 0) {
stop("Data couldn't be mapped to census data")
}
```

I could have run the check just once from the console, but storing it in the notebook has some core advantages:

- The test documents some of the intent and assumptions you make about your analysis and/or data.
- The test code can be found and reused later.
- When you change some preprocessing steps or update your data to a newer version you will be notified of problems.

The last advantage became apparent when I had a complicated join that was however supposed to only give exactly one match for each row in the data, so I wrote:

```
main_data_augmented <- main_data %>%
inner_join(...) %>%
...lots of complex joining...
if(nrow(main_data_augmented) != nrow(main_data)) {
stop("Unexpected number of rows after join.")
}
```

This worked nicely. Later, I changed how I prepare the tables that go into the join, thinking it couldn’t break anything and the above check fired, because (obviously) I made a mistake. If there was no check, the downstream analysis would run without complaining, but some rows from `main_data`

would be actually copied twice as they now had more matches.

Since you should do such checks anyway to ensure your analysis is correct, storing them in the markdown file is very little additional effort and can save you a lot of trouble. So please, write tests and checks within your notebooks!

]]>I show how t-test can fail, inspired by Andrew Gelman’s take on power = .06 and how DESeq2 (used to determine differentially expressed genes) does a good job at mitigating false positives at the cost of increased false negatives.

The slides are here or check the GitHub repo with simulation code.

]]>*This is a second post in my series on taming divergences in Stan models, see the first post in the series for a general introduction. Also see guide to Stan warnings*

**Standard caveat:** *I am not an expert on Stan, I consider myself just an advanced user who likes to explain things. Please point out any errors, things that contradict your experience or anything else you do not trust.*

In a strict sense, it means that two values of the parameters result in the same probability distribution of observed data. It is also sometimes used to cover situations when there is not a unique local maximum of the posterior density - either because there are multiple separate maxima or because there is ridge/plateau where a set of points has the same posterior density (those may or may not be identifiable in the strict sense).

On the Stan forums the term seems to be used in even a bit broader sense and also covers cases where the maximum of the posterior density is in a region that is approximately flat. This often happens when the posterior is dominated by the prior and the likelihood provides little information about the parameters. If this is the case, it is sometimes said that the model is *weakly identified*. A weakly identified model may become non-identified in the strict sense if a prior is not specified for all parameters. This is just another reason to specify proper priors for everything.

Problems with identifiability are just one class of issues that are signalled in Stan by divergences and/or other diagnostics (max treedepth, low BFMI, low `n_eff`

, large `Rhat`

), the first post in this series has a more extensive list of other possible causes. Note also that except for Stan, most statistical/ML software won’t complain when you try to fit non-identifiable models, even though it may lead to noticeably biased inferences.

In this post I will show a few different types of issues that result from limited identifiability. I will also try to show how to spot these problems in various visualisations. Remember that instead of creating the plots in code as we do here, you can use ShinyStan to explore many visualisations interactively.

We will start with some weakly identified non-linear regression models and move toward models that are hopelessly multimodal (have multiple local maxima of posterior density), including a small neural network and Gaussian process with a Berkson-style error. I will focus on models that don’t have any obvious error (like ommitted prior), although such errors can lead to non-identifiability. A recurring theme in this post is that identifiability may depend on the actual data as well as the model, keep that in mind when modelling!

A frequent source of fitting issues due to non-identifiability are mixture models. There is an excellent case study for mixture models by Michael Betancourt and I have nothing to add to this topic, so mixtures are not covered here.

The post got pretty long so let’s not hesitate and get our hands dirty!

```
library(tidyverse)
library(rstan)
library(bayesplot)
library(tidybayes)
library(knitr)
library(here)
library(rgl)
knit_hooks$set(webgl = hook_webgl)
theme_set(cowplot::theme_cowplot())
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
```

First, let’s start with a model I thought would have problems but that ended up mostly OK. The model is a simple regression with a quadratic term:

\[ \begin{align} \mu_i &= \beta_1 x_i + \beta_2 x_i^2 \\ y_i &\sim N(\mu_i, \sigma) \end{align} \] and here is the Stan code:

```
stan_code_linear <- "
data {
int N;
vector[N] y;
vector[N] x;
real<lower=0> sigma;
real<lower=0> prior_width;
}
parameters {
real beta[2];
}
model {
y ~ normal(beta[1] * x + beta[2] * square(x), sigma);
sigma ~ normal(0,1);
beta ~ normal(0,prior_width);
}
"
model_linear <- stan_model(model_code = stan_code_linear)
```

When wide array of \(x\) values is available, this model works without any trouble. But what happens when all \(x_i \in \{0,1\}\)? In this case the likelihood cannot distinguish between the contribution of \(\beta_1\) and \(\beta_2\). Let’s simulate some data and have a look at the pairs plot:

```
set.seed(20180512)
sigma = 1
x = rep(c(0,1), times = 10)
data_linear <- list(
N = length(x),
x = x,
y = rnorm(length(x), x + x ^ 2, sigma),
sigma = sigma,
prior_width = 10
)
fit_linear <- sampling(model_linear, data = data_linear)
pairs(fit_linear, pars = "beta")
```

While the result is not very useful, the sampler worked well! We gained little information about each \(\beta\) individually (their range spans alomost all of the prior), but their sum is tightly constrained as witnessed by the strong negative correlation. So what if we increase `prior_width`

to make the prior resemble a flat prior? We do get max treedepth warnings!

```
set.seed(21645465)
data_linear2 <- data_linear
data_linear2$prior_width = 100
fit_linear2 <- sampling(model_linear, data = data_linear2)
```

```
## Warning: There were 263 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## https://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
```

`## Warning: Examine the pairs() plot to diagnose sampling problems`

`pairs(fit_linear2, pars = "beta")`

What seems to happen is that the ridge in the posterior becomes too long and the sampler cannot traverse it efficiently, resulting in transitions that exceed the maximum treedepth. The root of the problem is that the sampler has to choose a step size that is shorter than the width of the ridge to not diverge when moving tangentially to the ridge direction. With such a small step size, the sampler cannot move across the length of the ridge in one iteration. Still, the results should be unbiased and if we manage to get a reasonable `n_eff`

(which we do), there is no reason to worry. Improper flat prior would however lead to some actual trouble. Don’t use improper priors, folks!

**Take home message**: While this particular model works well unless we set the prior too wide, linear correlations in the pairs plot are a bad sign and you should try to avoid them as they can produce problems when interacting with other components of a larger model. It might make sense to reparametrize using the sum or ratio of the variables in question.

Let’s move to a model where the non-identifiability actually wreaks havoc. Once again, the model works for some data, but breaks for others - even if all model assumptions do hold. The model tries to determine parameters of a sigmoid function that is observed noisily:

\[ \begin{align} y_{true} &= \frac{1}{1 + e^{-wx-b}} \\ y &\sim N(y_{true},\sigma) \end{align} \]

Here \(w\) and \(b\) are the only parameters of the model. The model is a bit artificial, but it is actually a component of larger gene expression models I work with. The corresponding Stan code is below:

```
stan_code_sigmoid <- "
data {
int N;
vector[N] y;
vector[N] x;
real<lower=0> prior_width;
real<lower=0> sigma;
}
parameters {
real w;
real b;
}
model {
vector[N] y_true = inv_logit(w * x + b);
y ~ normal(y_true, sigma);
w ~ normal(0,prior_width);
b ~ normal(0,prior_width);
}
"
model_sigmoid <- stan_model(model_code = stan_code_sigmoid)
```

Now lets fit the model to simulated datasets with the exact same true parameter values \(w = b = 1\), but different values of the independent variable \(x\). In the first case, \(x\) is drawn from \(N(0,2)\):

```
set.seed(214575878)
simulate_sigmoid <- function(x) {
sigma = 0.1
w = 1
b = 1
N = length(x)
y_true = 1 / (1 + exp(-w*x-b))
prior_width = 10
list(
N = N,
x = x,
y = rnorm(N, y_true, sigma),
prior_width = prior_width,
sigma = sigma
)
}
data_sigmoid_ok <- simulate_sigmoid(rnorm(20, 0, 2))
fit_sigmoid_ok <- sampling(model_sigmoid, data = data_sigmoid_ok)
print(fit_sigmoid_ok)
```

```
## Inference for Stan model: anon_model.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## w 1.02 0.00 0.14 0.78 0.93 1.01 1.11 1.31 1295 1
## b 1.12 0.00 0.19 0.79 0.99 1.10 1.23 1.51 1404 1
## lp__ -14.66 0.03 1.06 -17.48 -15.07 -14.33 -13.91 -13.66 1318 1
##
## Samples were drawn using NUTS(diag_e) at Sat May 14 13:26:16 2022.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
```

`pairs(fit_sigmoid_ok, pars = c("w","b"))`

For the first dataset, the model converges and recovers parameters correctly. The pairs plot shows a nice gaussian blob, nothing to worry about. Now let’s try a second dataset, this time \(x\) is drawn from \(N(5,2)\).

```
data_sigmoid_divergent <- simulate_sigmoid(rnorm(20, 5, 2))
fit_sigmoid_divergent <- sampling(model_sigmoid, data = data_sigmoid_divergent)
```

```
## Warning: There were 281 divergent transitions after warmup. See
## https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.
```

`## Warning: Examine the pairs() plot to diagnose sampling problems`

```
## Warning: The largest R-hat is 1.06, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#r-hat
```

```
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#bulk-ess
```

```
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#tail-ess
```

`print(fit_sigmoid_divergent)`

```
## Inference for Stan model: anon_model.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## w 3.91 0.59 5.55 0.30 0.55 0.93 5.62 19.44 89 1.04
## b 1.82 0.17 1.81 0.36 0.90 1.29 2.14 6.04 108 1.03
## lp__ -10.63 0.29 2.48 -16.30 -12.56 -9.79 -8.44 -7.74 73 1.06
##
## Samples were drawn using NUTS(diag_e) at Sat May 14 13:26:52 2022.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
```

For the second dataset, there is a huge number of divergences and the parameters are largely uncertain and overestimated. So once again, the model is identifiable in principle, but the extreme data introduce problems.

Let’s try to visualise the posteriors - luckily we only have two parameters, so it is easy to see everything at once!

`pairs(fit_sigmoid_divergent, pars = c("w","b"))`

The issue seems to be that for the second dataset, we sampled \(x\) towards the tail of the sigmoid and almost all \(y\) are thus close to 1, giving us little information about the parameters. However, the model strictly enforces that \(w x + b > 0\). This creates a large area of the posterior where the value of \(w\) and \(b\) does not matter much (requiring a large step size to traverse) and a thin sharp boundary around \(wx + b \simeq 0\) where a smaller step size is required to traverse safely. Transitions crossing the boundary often diverge due to large step size and are rejected, leading to overexploration of the flat area and bias.

This is, in my experience, one of the ways that weak identifiability may hurt sampling - the model maybe weakly identified only for a subset of the parameter space while there is another area where the likelihood has a huge contribution and this may require different step size for the sampler.

The issue may be partially redeemed by reparametrization using \(a = \mathrm{E}(wx + b), b = \mathrm{sd}(wx + b)\). You can then set priors on \(a, b\) that avoid the tails of the sigmoid while being independent of \(x\).

**Take home message**: Sharp boundaries of otherwise diffuse regions in the posterior (as seen above) are worth investigating.

We can make the above model even more problematic by introducing a parameter \(k\) to generalize the sigmoid a bit more:

\[ \begin{align} y_{true} &= \frac{k}{1 + e^{-wx-b}} \\ y &\sim N(y_{true},\sigma) \end{align} \]

giving us the following Stan code:

```
stan_code_sigmoid2 <- "
data {
int N;
vector[N] y;
vector[N] x;
real<lower=0> prior_width;
real<lower=0> sigma;
}
parameters {
real k;
real w;
real b;
}
model {
vector[N] y_true = k * inv_logit(w * x + b);
y ~ normal(y_true, sigma);
w ~ normal(0,prior_width);
b ~ normal(0,prior_width);
}
"
model_sigmoid2 <- stan_model(model_code = stan_code_sigmoid2)
```

This time we will simulate two datasets with the same \(x \sim N(0,2)\), avoiding the extreme values. We will set \(k = b = 1\) for both datasets. In addition, the first dataset will have \(w = 1\). Let’s see how that goes:

```
set.seed(98321456)
simulate_sigmoid2 <- function(x, w) {
sigma = 0.1
k = 1
b = 1
N = length(x)
prior_width = 10
y_true = k / (1 + exp(-w*x-b))
list(
N = N,
x = x,
y = rnorm(N, y_true, sigma),
prior_width = prior_width,
sigma = sigma
)
}
data_sigmoid2_ok <- simulate_sigmoid2(rnorm(20, 0, 2), w = 1)
fit_sigmoid2_ok <- sampling(model_sigmoid2, data = data_sigmoid2_ok)
print(fit_sigmoid2_ok)
```

```
## Inference for Stan model: anon_model.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## k 1.05 0.00 0.09 0.90 0.99 1.04 1.11 1.23 1093 1.01
## w 1.08 0.01 0.20 0.75 0.94 1.05 1.20 1.53 1062 1.01
## b 1.06 0.01 0.40 0.38 0.76 1.02 1.31 1.94 1013 1.01
## lp__ -9.06 0.04 1.22 -12.17 -9.65 -8.75 -8.15 -7.63 1210 1.00
##
## Samples were drawn using NUTS(diag_e) at Sat May 14 13:31:14 2022.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
```

Nice, no problems here. Let’s see what happens when we set \(w = 0\)

```
data_sigmoid2_divergent <- simulate_sigmoid2(rnorm(20, 0, 2), w = 0)
fit_sigmoid2_divergent <- sampling(model_sigmoid2, data = data_sigmoid2_divergent)
```

```
## Warning: There were 21 divergent transitions after warmup. See
## https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.
```

```
## Warning: There were 391 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## https://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
```

`## Warning: Examine the pairs() plot to diagnose sampling problems`

```
## Warning: The largest R-hat is 1.57, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#r-hat
```

```
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#bulk-ess
```

```
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#tail-ess
```

`print(fit_sigmoid2_divergent)`

```
## Inference for Stan model: anon_model.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75%
## k 2471033772.87 2.960192e+09 5.236172e+09 0.68 0.72 0.73 15413114.53
## w -0.28 7.000000e-02 1.570000e+00 -4.07 -0.98 0.02 0.41
## b 4.51 1.134000e+01 1.673000e+01 -23.99 -2.97 10.64 15.76
## lp__ -12.38 2.700000e-01 1.240000e+00 -15.46 -13.00 -12.23 -11.43
## 97.5% n_eff Rhat
## k 1.875328e+10 3 2.22
## w 2.770000e+00 471 1.01
## b 2.624000e+01 2 3.46
## lp__ -1.068000e+01 22 1.09
##
## Samples were drawn using NUTS(diag_e) at Sat May 14 13:32:13 2022.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
```

As has become the habit in this post, the model diverges for the second dataset. Note also that the parameter estimates for \(k\) and \(b\) are way off and their 95% credible intervals exclude the true value! (remember: even a few divergent transitions indicate sampling problems). Let’s have a look at the pair plots for both models:

`pairs(fit_sigmoid2_ok, pars = "energy__", include = FALSE)`

`pairs(fit_sigmoid2_divergent, pars = "energy__", include = FALSE)`

Clearly something substantial changed for the second model. But while there is a lot of stuff that looks fishy, it is hard to understand what exactly is going on. The clue is in looking at the interaction of \(w\), \(k\) and the log posterior (`lp__`

) - you can play with the 3D plot below with your mouse. Note that ShinyStan provides similar 3D plots under `Explore -> Trivariate`

.

```
samples <- rstan::extract(fit_sigmoid2_divergent)
open3d() %>% invisible()
rgl::plot3d(samples$w, samples$k, samples$lp__, xlab = "w", ylab = "k", zlab = "lp__")
```

The thing to notice here is that for \(w = 0\), there is a thin ridge in the posterior with almost maximal `lp__`

for a wide range of values for \(k\). This is because for \(w = 0\), the posterior ceases to depend on \(x\) and \(k\) becomes very closely tied to \(b\) - simultaneously almost any value of \(k\) that is admitted by the prior becomes feasible with suitable value of \(b\). The ridge is so thin, that the sampler almost completely misses it and it is barely visible in the plot. But for the true posterior, this ridge should contribute a non-trivial amount of mass. Once again, the sampler adapts its step size to the wide distribution for \(w \neq 0\) which leads to divergences and rejections when traversing the narrow ridge at \(w = 0\), bringing the problem to our attention.

What can you do about these kinds of problems? When you fit your model interactively, the best solution is to spot that your actual data are a special case and simplify your model. Special cases however become more of a worry when you need to automatically fit the model to a large number of datasets or refit it periodically as new data become available. The best solution I have found so far is to try fitting a simplified model first (here something like \(x \sim N(a,\sigma)\)). If the simplified model fits well and/or the full model diverges, the results of the simplified model are preferred. If you know of a better solution, please let me know - it is directly relevant to my work on models of gene regulation! The link also contains more discussion of the reparametrizations I use to make this kind of models converge.

**Take home message**: Worry about degenerate special cases of your model. A 3D trivariate plot of two parameters vs. the posterior, makes it really neatly visible when your posterior is not unimodal.

At this point it would make sense to mention mixture models, but as those are covered by the aforementioned case study, we’ll go straight ahead to the desperate wilderness of models too broken to fix. And neural networks are a prime attraction in this godforsaken land.

We don’t need to go fancy. Let’s have a feedforward neural net with two inputs, two hidden layers of two neurons each and a single output neuron.

We will use the standard logistic sigmoid activation function and treat the problem as a binary classification. To make things simpler and because we saw that sigmoid may be non-dentifiable by itself, we ignore all the bias parameters, so the only parameters are the weights \(w\) of inputs \(x\) and the activation function becomes:

\[ \frac{1}{1+e^{-\sum w_i x_i}} \]

Below is the corresponding Stan model - optimized for readability, not brevity or generalizability. Since it seems there might be some symmetries, and we learned our lesson from mixture models, we’ll try at least to force the weights for the output neuron to be ordered.

```
stan_code_neural <- "
data {
int N;
matrix[N,2] x;
int<lower=0, upper=1> y[N];
real prior_width;
}
parameters {
matrix[2,2] weights1;
matrix[2,2] weights2;
ordered[2] weights_out;
}
model {
matrix[N,2] input1 = x * weights1;
matrix[N,2] output1 = inv_logit(input1);
matrix[N,2] input2 = output1 * weights2;
matrix[N,2] output2 = inv_logit(input2);
vector[N] input_out = output2 * weights_out;
vector[N] output_out = inv_logit(input_out);
y ~ bernoulli(output_out);
to_vector(weights1) ~ normal(0, prior_width);
to_vector(weights2) ~ normal(0, prior_width);
weights_out ~ normal(0, prior_width);
}
"
model_neural <- stan_model(model_code = stan_code_neural)
```

In the spirit of the best traditions of the field of machine learning, we’ll try to teach XOR to the neural network. It does not go well. To make the pathologies better visible, we will use 8 chains instead of the usual 4.

```
set.seed(1324578)
sigma <- 0.1
N <- 200
x <- array(as.integer(rbernoulli(N * 2)), c(N,2))
y <- xor(x[,1], x[,2])
data_neural <- list(N = N, x = x, y = y, sigma = sigma, prior_width = 5)
fit_neural <- sampling(model_neural, data = data_neural, chains = 8)
```

```
## Warning: There were 178 divergent transitions after warmup. See
## https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.
```

`## Warning: Examine the pairs() plot to diagnose sampling problems`

```
## Warning: The largest R-hat is 2.34, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#r-hat
```

```
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#bulk-ess
```

```
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#tail-ess
```

```
show_param_diags <- function(fit) {
summary(fit)$summary[,c("n_eff","Rhat")]
}
show_param_diags(fit_neural)
```

```
## n_eff Rhat
## weights1[1,1] 4.586133 2.842062
## weights1[1,2] 4.806146 2.472001
## weights1[2,1] 4.904083 2.345798
## weights1[2,2] 4.705657 2.597202
## weights2[1,1] 4.503922 3.078342
## weights2[1,2] 4.815908 2.502825
## weights2[2,1] 4.626493 2.776578
## weights2[2,2] 5.432725 1.999367
## weights_out[1] 4.433889 3.237023
## weights_out[2] 4.896132 2.350218
## lp__ 4.061244 8.279744
```

We note the divergences, very low `n_eff`

and large `Rhat`

. Remember that `n_eff`

(effective sample size) is a measure of how well the posterior was sampled and `Rhat`

is close to one if the chains have converged to the same posterior. This time let’s start by inspecting some traceplots - I chose the ones I consider most interesting but in practice you would want to look at all of them (once again those are also available in ShinyStan):

```
fit_for_bayesplot <- as.array(fit_neural)
mcmc_trace(fit_for_bayesplot, pars = c("weights2[1,2]","weights1[2,2]"),
facet_args = list(ncol = 1)) + scale_color_discrete()
```

We clearly see that there are multiple modes and each chain is stuck in its mode and does not mix with others. The first trace plot shows that just investigating the marginal posterior for `weights2[1,2]`

reveals 3 well separated modes. Looking at the traceplot for `weights1[2,2]`

we see that there have to be even more modes as here, chain 2 (green-brown-ish?) clusters with chain 5 (blue), while in the first traceplot it clusters with chain 8 (pink).

Looking at the pairs plot provides some additional hints:

`pairs(fit_neural, pars = c("weights2[1,2]","lp__"))`

We see that at least the two “denser” modes are symmetric across zero and that they reach about the same maximum `lp__`

as the “diffuse” mode. This means that to identify the model we need have to somehow choose one of those modes, but there is clearly not a “best” (much higher `lp__`

) mode.

My best guess is that the divergences arise when the sampler (unsuccesfully) tries to switch between individual modes and the geometry gets narrower in some parameters and wider in others, but this is just guesswork.

There is not really much that can be done to make such models work. The most obvious issues come from symmetries of the network structure, providing multiple modes when the neurons are relabelled, but the network is isomorphic. To some extent we could get rid of them by ordering one of the weights in each layer. However, ordering is just the start and further issues just keep on coming - see for example the forum thread on Bayesian neural networks for more details.

**Take home message**: Non overlaping traces without treedepth warnings indicate multimodality. If the modes have about the same `lp__`

, some symmetry breaking constraints may help. If there is one mode with much larger `lp__`

than the others, it might make sense to favor this one by appropriate priors/reparametrization.

In the above example, there were multiple discrete and well separated modes. But there is still a way to move non-identifiability to the next level. We’ll start with a simple and harmless Gaussian process model with squared exponential covariance function, the model is:

\[ \begin{align} y_{est} &\sim GP(\rho,\tau) \\ y &\sim~ N(y_{est}(x), \sigma) \end{align} \]

Here, \(x \in (0,1)\) are the locations where the GP is observed. The corresponding Stan code is:

```
stan_code_gp <- "
data {
int N;
real x[N];
vector[N] y;
real<lower=0> gp_length;
real<lower=0> gp_scale;
real<lower=0> sigma;
}
transformed data {
vector[N] mu = rep_vector(0, N);
}
parameters {
vector[N] y_est_raw;
}
transformed parameters {
vector[N] y_est;
// Using the latent variable GP coding form Stan manual,
// with the Cholesky decomposition
{
matrix[N, N] L_K;
matrix[N, N] K = cov_exp_quad(x, gp_scale, gp_length);
for (n in 1:N) {
K[n, n] = K[n, n] + 1e-12; //Ensure positive definite
}
L_K = cholesky_decompose(K);
y_est = L_K * y_est_raw;
}
}
model {
y ~ normal(y_est, sigma);
y_est_raw ~ normal(0, 1);
}
"
model_gp <- stan_model(model_code = stan_code_gp)
```

We simulate some data and check that the model works well:

```
set.seed(25748422)
simulate_gp <- function(x) {
N <- length(x)
gp_length <- 0.3
gp_scale <- 1
sigma <- 0.1
cov_m <- matrix(0, nrow <- N, ncol <- N)
for(i in 1:N) {
for(j in i:N) {
cov_m[i,j] <- gp_scale ^ 2 * exp(-0.5 * (1 / gp_length ^ 2) * (x[i] - x[j])^2)
cov_m[j,i] <- cov_m[i,j]
}
}
chol_cov_m <- chol(cov_m)
y <- chol_cov_m %*% rnorm(N, 0, 1)
list(N = N, x = x, y = array(y, N), gp_length = gp_length, gp_scale = gp_scale, sigma = sigma)
}
data_gp <- simulate_gp(x = seq(0.01,0.99, length.out = 10))
fit_gp <- sampling(model_gp, data = data_gp)
show_param_diags(fit_gp) %>% head()
```

```
## n_eff Rhat
## y_est_raw[1] 3159.529 1.000177
## y_est_raw[2] 2315.263 1.000567
## y_est_raw[3] 1975.222 1.000598
## y_est_raw[4] 1860.736 1.001232
## y_est_raw[5] 2289.527 1.001320
## y_est_raw[6] 2555.396 1.001130
```

We note the good diagnostics and also look at the posterior draws versus the observed values.

```
samples_to_show <- sample(1:4000, 50)
fit_gp %>%
tidybayes::spread_draws(y_est[x_index]) %>%
inner_join(data.frame(x_index = 1:data_gp$N, x = data_gp$x), by = c("x_index" = "x_index")) %>%
mutate(sample_id = (.chain - 1) * 1000 + .iteration ) %>%
filter(sample_id %in% samples_to_show) %>%
ggplot(aes(x = x, y = y_est, group = sample_id, color = as.factor(.chain))) + geom_line(alpha = 0.3) +
geom_point(data = data.frame(x = data_gp$x, y = data_gp$y), aes(x = x, y = y), inherit.aes = FALSE)
```

And now a magic trick that turns this well-behaved model into a mess: we’ll treat \(x\), the locations where the GP is observed as unknown. Since \(x \in (0,1)\), we can specify a Beta prior for the locations with varying precision. The modified Stan code follows.

```
stan_code_gp_mess <- "
data {
int N;
vector[N] y;
real<lower=0> gp_length;
real<lower=0> gp_scale;
real<lower=0> sigma;
vector<lower=0, upper=1>[N] x_prior_mean;
real<lower=0> x_prior_tau;
}
transformed data {
vector[N] mu = rep_vector(0, N);
}
parameters {
real<lower=0, upper = 1> x[N];
vector[N] y_est_raw;
}
transformed parameters {
vector[N] y_est;
// Using the latent variable GP coding form Stan manual,
// with the Cholesky decomposition
{
matrix[N, N] L_K;
matrix[N, N] K = cov_exp_quad(x, gp_scale, gp_length);
for (n in 1:N) {
K[n, n] = K[n, n] + 1e-12; //Ensure positive definite
}
L_K = cholesky_decompose(K);
y_est = L_K * y_est_raw;
}
}
model {
y ~ normal(y_est, sigma);
y_est_raw ~ normal(0, 1);
x ~ beta(x_prior_mean * x_prior_tau, (1 - x_prior_mean) * x_prior_tau);
}
"
model_gp_mess <- stan_model(model_code = stan_code_gp_mess)
```

Let’s start with noninformative uniform prior on \(x\).

```
set.seed(42148744)
data_gp_mess_uniform <- data_gp
#This puts uniform prior on all x
data_gp_mess_uniform$x_prior_mean = rep(0.5, data_gp$N)
data_gp_mess_uniform$x_prior_tau = 2
fit_gp_mess <- sampling(model_gp_mess, data = data_gp_mess_uniform)
```

```
## Warning: There were 565 divergent transitions after warmup. See
## https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.
```

```
## Warning: There were 2185 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## https://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
```

`## Warning: Examine the pairs() plot to diagnose sampling problems`

```
## Warning: The largest R-hat is 2.51, indicating chains have not mixed.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#r-hat
```

```
## Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#bulk-ess
```

```
## Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
## Running the chains for more iterations may help. See
## https://mc-stan.org/misc/warnings.html#tail-ess
```

`show_param_diags(fit_gp_mess) %>% head()`

```
## n_eff Rhat
## x[1] 2.111522 4.678524
## x[2] 2.081677 5.845164
## x[3] 2.064901 6.388092
## x[4] 2.100400 5.438441
## x[5] 2.046444 6.949529
## x[6] 11.013634 1.391479
```

We can note that we got *both* divergences and max treedepth, meaning that the step size is sometimes too large and sometimes too small, also both `n_eff`

and `Rhat`

are atrocious. Let’s inspect some traces:

```
fit_for_bayesplot <- as.array(fit_gp_mess)
mcmc_trace(fit_for_bayesplot, pars = c("x[1]","y_est[1]", "lp__"), facet_args = list(ncol = 1)) + scale_color_discrete()
```

```
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
```

In contrast to the neural net example, here the chains do not stick to a single mode, instead, they slowly wander across a wide range of values. Further, we see that the log posterior more or less overlaps across the explored parts of the parameter space. How does that look like when we plot the posterior?

```
plot_gp_mess <- function(fit_gp_mess) {
samples_to_show <- sample(1:4000, 50)
fit_gp_mess %>%
tidybayes::spread_draws(y_est[x_index], x[x_index]) %>%
mutate(sample_id = (.chain - 1) * 1000 + .iteration ) %>%
filter(sample_id %in% samples_to_show) %>%
ggplot(aes(x = x, y = y_est, group = sample_id, color = as.factor(.chain))) + geom_line(alpha = 0.3) +
geom_point(data = data.frame(x = data_gp$x, y = data_gp$y), aes(x = x, y = y), inherit.aes = FALSE)
}
plot_gp_mess(fit_gp_mess)
```

Not even close! We also see that the chains actually do sample different regions of the whole posterior, despite overlapping in marginal posteriors. Lets also take a look at some pairs plots:

`pairs(fit_gp_mess, pars = c("x[1]","y_est[1]", "lp__"))`

Those pairs are not of much help, except that they once again show that there is no clean separation in log posterior (`lp__`

) between the individual modes. Taken together this indicates that the posterior likely has many modes connected by thin, possibly curved ridges with only slightly smaller `lp__`

. This is not surprising, since ordering of \(x\) is only weakly constrained by the GP variance (when \(x\) is close together, \(y\) should be as well). In fact, we would expect the number of modes to be of order \(N!\) (factorial of \(N\)).

A nice diagnostic trick is to set informative priors on \(x\), centered on the true values. The width of the prior required to make the model identified tells us something about the severity of the issues. This is where the Beta prior, in particular it’s parametrization via mean and precision (\(\tau\)) comes in handy. So, lets see if \(\tau = 1000\), e.g. the amount of information contained in 1000 coin flips is enough. Note that we also have to init the chains around the true values, otherwise the sharp prior introduces sampling problems.

```
set.seed(741284)
data_gp_mess_informed <- data_gp
data_gp_mess_informed$x_prior_mean = data_gp$x
data_gp_mess_informed$x_prior_tau = 1000
informed_init <- function(){
list(x = data_gp$x)
}
fit_gp_mess_informed <- sampling(model_gp_mess, data = data_gp_mess_informed, init = informed_init)
show_param_diags(fit_gp_mess_informed) %>% head()
```

```
## n_eff Rhat
## x[1] 7807.223 0.9996608
## x[2] 6203.740 1.0000260
## x[3] 7690.874 0.9993963
## x[4] 6966.835 0.9993572
## x[5] 7530.549 0.9993165
## x[6] 7651.445 0.9992267
```

`plot_gp_mess(fit_gp_mess_informed)`

```
## Warning: 'tidybayes::spread_samples' is deprecated.
## Use 'spread_draws' instead.
## See help("Deprecated") and help("tidybayes-deprecated").
```

And it indeed is enough, the diagnostics look good, the posterior looks good, everything’s peachy. But we needed a very narrow prior. And we could not get away with much less information, consider setting \(\tau = 500\) (still a very narrow prior):

```
set.seed(32148422)
data_gp_mess_less_informed <- data_gp
data_gp_mess_less_informed$x_prior_mean = data_gp$x
data_gp_mess_less_informed$x_prior_tau = 500
fit_gp_mess_less_informed <- sampling(model_gp_mess, data = data_gp_mess_less_informed, init = informed_init)
```

```
## Warning: There were 3 divergent transitions after warmup. See
## https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## to find out why this is a problem and how to eliminate them.
```

`## Warning: Examine the pairs() plot to diagnose sampling problems`

`show_param_diags(fit_gp_mess_less_informed) %>% head()`

```
## n_eff Rhat
## x[1] 6299.314 0.9995523
## x[2] 6007.538 0.9992767
## x[3] 6495.730 0.9996561
## x[4] 6313.321 1.0002979
## x[5] 6340.341 1.0005830
## x[6] 6331.998 0.9991734
```

`plot_gp_mess(fit_gp_mess_less_informed)`

```
## Warning: 'tidybayes::spread_samples' is deprecated.
## Use 'spread_draws' instead.
## See help("Deprecated") and help("tidybayes-deprecated").
```

Even though the posterior looks more or less OK, we see that the chains have not mixed well (notably, chain 2 forms a slightly separate cluster) as also indicated by some of the diagnostics. So we can conclude that the model is screwed as it is not identified, unless we already now the values of \(x\) quite precisely.

**Take home message**: Chains wandering slowly across large areas of posterior but with roughly the same `lp__`

is a very bad sign. Putting narrow priors centered on true parameters is a neat trick to understand your model better.

That’s it - if you’ve made it to the bottom of this looooong post, you are great and thanks very much! I really hope that it will help you interpret your own models and help determine how to fix them or when to abandon a hopeless situation and get back to the drawing board. I also hope I have convinced you that identifiability depends not only on the model but also on the actual observed dataset.

Best of luck with modelling!

`sessionInfo()`

```
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=Czech_Czechia.1250 LC_CTYPE=Czech_Czechia.1250
## [3] LC_MONETARY=Czech_Czechia.1250 LC_NUMERIC=C
## [5] LC_TIME=Czech_Czechia.1250
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rgl_0.108.3 here_1.0.1 knitr_1.38 tidybayes_3.0.1
## [5] bayesplot_1.8.1 rstan_2.26.6 StanHeaders_2.26.6 forcats_0.5.1
## [9] stringr_1.4.0 dplyr_1.0.7 purrr_0.3.4 readr_2.1.0
## [13] tidyr_1.1.4 tibble_3.1.6 ggplot2_3.3.5 tidyverse_1.3.1
##
## loaded via a namespace (and not attached):
## [1] colorspace_2.0-2 ellipsis_0.3.2 ggridges_0.5.3
## [4] rprojroot_2.0.2 fs_1.5.2 rstudioapi_0.13
## [7] farver_2.1.0 svUnit_1.0.6 fansi_0.5.0
## [10] lubridate_1.8.0 xml2_1.3.2 codetools_0.2-18
## [13] extrafont_0.17 cachem_1.0.6 jsonlite_1.7.2
## [16] broom_0.8.0 Rttf2pt1_1.3.9 dbplyr_2.1.1
## [19] ggdist_3.0.0 compiler_4.1.0 httr_1.4.2
## [22] backports_1.3.0 assertthat_0.2.1 fastmap_1.1.0
## [25] cli_3.2.0 htmltools_0.5.2 prettyunits_1.1.1
## [28] tools_4.1.0 coda_0.19-4 gtable_0.3.0
## [31] glue_1.6.2 reshape2_1.4.4 posterior_1.1.0.9000
## [34] V8_3.6.0 Rcpp_1.0.7 cellranger_1.1.0
## [37] jquerylib_0.1.4 pkgdown_2.0.1 vctrs_0.3.8
## [40] blogdown_1.6 extrafontdb_1.0 tensorA_0.36.2
## [43] xfun_0.30 ps_1.6.0 rvest_1.0.2
## [46] lifecycle_1.0.1 scales_1.1.1 hms_1.1.1
## [49] parallel_4.1.0 inline_0.3.19 yaml_2.2.1
## [52] curl_4.3.2 memoise_2.0.0 gridExtra_2.3
## [55] loo_2.4.1 sass_0.4.0 stringi_1.7.5
## [58] highr_0.9 checkmate_2.0.0 pkgbuild_1.2.0
## [61] rlang_0.4.12 pkgconfig_2.0.3 matrixStats_0.61.0
## [64] distributional_0.2.2 evaluate_0.15 lattice_0.20-44
## [67] htmlwidgets_1.5.4 labeling_0.4.2 cowplot_1.1.1
## [70] processx_3.5.2 tidyselect_1.1.1 plyr_1.8.6
## [73] magrittr_2.0.1 bookdown_0.24 R6_2.5.1
## [76] generics_0.1.1 DBI_1.1.1 pillar_1.6.4
## [79] haven_2.4.3 withr_2.4.2 abind_1.4-5
## [82] modelr_0.1.8 crayon_1.4.2 arrayhelpers_1.1-0
## [85] KernSmooth_2.23-20 utf8_1.2.2 tzdb_0.2.0
## [88] rmarkdown_2.11 grid_4.1.0 readxl_1.3.1
## [91] callr_3.7.0 reprex_2.0.1 digest_0.6.28
## [94] RcppParallel_5.1.4 stats4_4.1.0 munsell_0.5.0
## [97] bslib_0.3.1
```

Let’s start with a very simple model: just estimating the mean and standard deviation of a normal distribution:

```
library(rstan)
library(knitr)
library(tidyverse)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = TRUE)
set.seed(3145678)
```

```
model_fixed_code <- "
data {
int N;
vector[N] X;
}
parameters {
real mu;
real<lower=0> sigma;
}
model {
X ~ normal(mu, sigma);
//And some priors
mu ~ normal(0, 10);
sigma ~ student_t(3, 0, 1);
}
"
model_fixed <- stan_model(model_code = model_fixed_code)
```

And let’s simulate some data and see that it fits:

```
mu_true = 8
sigma_true = 2
N = 10
X <- rnorm(N, mean = mu_true, sd = sigma_true)
data_fixed <- list(N = N, X = X)
fit_fixed <- sampling(model_fixed, data = data_fixed, iter = 500)
summary(fit_fixed, probs = c(0.1, 0.9))$summary %>% kable()
```

mean | se_mean | sd | 10% | 90% | n_eff | Rhat | |
---|---|---|---|---|---|---|---|

mu | 7.855031 | 0.0256139 | 0.5632183 | 7.162485 | 8.548415 | 483.5059 | 1.007501 |

sigma | 1.774158 | 0.0206974 | 0.4400573 | 1.302616 | 2.350727 | 452.0508 | 1.003409 |

lp__ | -12.103350 | 0.0555738 | 1.1132479 | -13.664610 | -11.091775 | 401.2768 | 1.004955 |

Let’s say we now want to handle the case where the standard deviation is known. Obviously we could write a new model. But what if the full model has several hundred lines and the only thing we want to change is to let the user specify the known standard deviation? The simplest solution is to just have all parameters/data that are needed in any of the variants lying around and use `if`

conditions in the model block to ignore some of them, but that is a bit unsatisfactory (and also those unused parameters may in some cases hinder sampling).

For a better solution, we can take advantage of the fact that Stan allows zero-sized arrays/vectors and features the *ternary operator* `?`

. The ternary operator has the syntax `(condition) ? (true value) : (false value)`

and works like an `if - else`

statement, but within an expression. The last piece of the puzzle is that Stan allows size of data and parameter arrays to depend on arbitrary expressions computed from data. The model that can handle both known and unknown standard deviation follows:

```
model_optional_code <- "
data {
int N;
vector[N] X;
//Just a verbose way to specify boolean variable
int<lower = 0, upper = 1> sigma_known;
//sigma_data is size 0 if sigma_known is FALSE
real<lower=0> sigma_data[sigma_known ? 1 : 0];
}
parameters {
real mu;
//sigma is size 0 if sigma_known is TRUE
real<lower=0> sigma_param[sigma_known ? 0 : 1];
}
transformed parameters {
real<lower=0> sigma;
if (sigma_known) {
sigma = sigma_data[1];
} else {
sigma = sigma_param[1];
}
}
model {
X ~ normal(mu, sigma);
//And some priors
mu ~ normal(0, 10);
if (!sigma_known) {
sigma_param ~ student_t(3, 0, 1);
}
}
"
model_optional <- stan_model(model_code = model_optional_code)
```

We had to add some biolerplate code, but now we don’t have to maintain two separate models. This trick is also sometimes useful if you want to test multiple variants in development. As the model compiles only once and then you can test the two variants while modifying other parts of your code and reduce time waiting for compilation.

Just to make sure the model works and see how to correctly specify the data, let’s fit it assuming the standard deviation is to be estimated:

```
data_optional <- list(
N = N,
X = X,
sigma_known = 0,
sigma_data = numeric(0) #This produces an array of size 0
)
fit_optional <- sampling(model_optional,
data = data_optional,
iter = 500, pars = c("mu","sigma"))
summary(fit_optional, probs = c(0.1, 0.9))$summary %>% kable()
```

mean | se_mean | sd | 10% | 90% | n_eff | Rhat | |
---|---|---|---|---|---|---|---|

mu | 7.854036 | 0.0198265 | 0.5440900 | 7.181837 | 8.531780 | 753.0924 | 0.9981102 |

sigma | 1.730077 | 0.0152808 | 0.3918781 | 1.308565 | 2.270505 | 657.6701 | 0.9989029 |

lp__ | -11.992770 | 0.0503044 | 0.9811551 | -13.383729 | -11.089657 | 380.4199 | 1.0016842 |

And now let’s run the model and give it the correct standard deviation:

```
data_optional_sigma_known <- list(
N = N,
X = X,
sigma_known = 1,
sigma_data = array(sigma_true, 1)
#The array conversion is necessary, otherwise Stan complains about dimensions
)
fit_optional_sigma_known <- sampling(model_optional,
data = data_optional_sigma_known,
iter = 500, pars = c("mu","sigma"))
summary(fit_optional_sigma_known, probs = c(0.1, 0.9))$summary %>% kable()
```

mean | se_mean | sd | 10% | 90% | n_eff | Rhat | |
---|---|---|---|---|---|---|---|

mu | 7.808058 | 0.0292710 | 0.6273565 | 7.017766 | 8.622762 | 459.3600 | 1.006164 |

sigma | 2.000000 | 0.0000000 | 0.0000000 | 2.000000 | 2.000000 | 1000.0000 | NaN |

lp__ | -11.072234 | 0.0321233 | 0.6750295 | -11.917321 | -10.585280 | 441.5753 | 1.002187 |

Obviously this method lets you do all sorts of more complicated things, in particular:

- When the optional parameter is a vector you can have something like

`vector[sigma_known ? 0 : n_sigma] sigma;`

- You can have more than two variants to choose from and then use something akin to

`real param[varaint == 5 ? 0 : 1];`

- If your conditions become more complex you can always put them into a user-defined function (for optional data) or
`transformed data`

block (for optional parameters) as in:

```
functions {
int compute_whatever_size(int X, int Y, int Z) {
//do stuff
}
}
data {
...
real whatever[compute_whatever_size(X,Y,Z)];
real<lower = 0> whatever_sigma[compute_whatever_size(X,Y,Z)];
}
transformed data {
int carebear_size;
//do stuff
carebear_size = magic_result;
}
parameters {
vector[carebear_size] carebear;
matrix[carebear_size,carebear_size] spatial_carebear;
}
```

View the compiled submission or check the Genexpi-Stan GitHub repository, to view the source code.

]]>Although my time with the Stan language has been enjoyable, there is one thing that is not fun when modelling with Stan. And it is the dreaded warning message:

```
There were X divergent transitions after warmup.
Increasing adapt_delta above 0.8 may help.
```

Now once you have increased `adapt_delta`

to no avail, what should you do? Divergences (and max-treedepth and low E-BFMI warnings alike) tell you there is something wrong with your model, but do not exactly tell you what. There are numerous tricks and strategies to diagnose convergence problems, but currently, those are scattered across Stan documentation, Discourse and the old mailing list. Here, I will try to bring all the tricks that helped me at some point to one place for the reference of future desperate modellers.

**This list is outdated, the Guide to Stan warnings
has a better list of debugging hints.**

I don’t want to keep you waiting, so below is a list of all strategies I have ever used to diagnose and/or remedy divergences:

Check your code. Twice. Divergences are almost as likely a result of a programming error as they are a truly statistical issue. Do all parameters have a prior? Do your array indices and for loops match?

Create a simulated dataset with known true values of all parameters. It is useful for so many things (including checking for coding errors). If the errors disappear on simulated data, your model may be a bad fit for the actual observed data.

Check your priors. If the model is sampling heavily in the very tails of your priors or on the boundaries of parameter constraints, this is a bad sign.

Visualisations: use

`mcmc_parcoord`

from the`bayesplot`

package, Shinystan and`pairs`

from`rstan`

. Documentation for Stan Warnings (contains a few hints), Case study - diagnosing a multilevel model, Gabry et al. 2017 - Visualization in Bayesian workflowMake sure your model is

*identifiable*- non-identifiability and/or multimodality (multiple local maxima of the posterior distributions) is a problem. Case study - mixture models, my post on non-identifiable models and how to spot them.Run Stan with the

`test_grad`

option.*Reparametrize*your model to make your parameters independent (uncorrelated) and close to N(0,1) (a.k.a change the actual parameters and compute your parameters of interest in the`transformed parameters`

block).Try

*non-centered parametrization*- this is a special case of reparametrization that is so frequently useful that it deserves its own bullet. Case study - diagnosing a multilevel model, Betancourt & Girolami 2015Move parameters to the

`data`

block and set them to their true values (from simulated data). Then return them one by one to`paremters`

block. Which parameter introduces the problems?Introduce tight priors centered at true parameter values. How tight need the priors to be to let the model fit? Useful for identifying multimodality.

Play a bit more with

`adapt_delta`

,`stepsize`

and`max_treedepth`

. Example

In the coming weeks I hope to be able to provide separate posts on some of the bullets above with a worked-out example. In this introductory post I will try to provide you with some geometric intuition behind what divergences are.

**Caveat:** *I am not a statistician and my understanding of Stan, the NUTS sampler and other technicalities is limited, so I might be wrong in some of my assertions. Please correct me, if you find mistakes.*

Make sure to follow Stan Best practices. Especially, **start with a simple model**, make sure it works and add complexity step by step. I really cannot repeat this enough. To be honest, I often don’t follow this advice myself, because just writing the full model down is so much fun. To be more honest, this has always resulted in me being sad and a lots of wasted time.

Also note that directly translating models from JAGS/BUGS often fails as Stan requires different modelling approaches. Stan developers have experienced first hand that some JAGS models produce wrong results and do not converge even in JAGS, but no one noticed before they compared their output to results from Stan.

Following the Stan manual:

A divergence arises when the simulated Hamiltonian trajectory departs from the true trajectory as measured by departure of the Hamiltonian value from its initial value.

What does that actually mean? Hamiltonian is a function of the posterior density and auxiliary momentum parameters. The auxiliary parameters are well-behaved by construction, so the problem is almost invariably in the posterior density. Keep in mind that for numerical reasons Stan works with the logarithm of posterior density (also known as: `log_prob`

, `__lp`

and `target`

). The NUTS sampler performs several discrete steps per iteration and is guided by the gradient of the density. With some simplification, the sampler assumes that the log density is approximately linear at the current point, i.e. that small change in parameters will result in small change in log-density. This assumption is approximately correct if the step size is small enough. Lets look at two different step sizes in a one-dimensional example:

The sampler starts at the red dot, the black line is the log-density, magenta line is the gradient. When moving 0.1 to the right, the sampler expects the log-density to decrease linearly (green triangle) and although the actual log-density decreases more (the green square), the difference is small. But when moving 0.4 to the right the difference between expected (blue cross) and actual (pink crossed square) becomes much larger. It is a large discrepancy of a similar kind that is signalled as a divergence. During warmup Stan will try to adjust the step size to be small enough for divergences to not occur, but large enough for the sampling to be efficient. But if the parameter space is not well behaved, this might not be possible. Why? Keep on reading, fearless reader.

Lets try to build some geometric intuition in 2D parameter space. Keep in mind that sampling is about exploring the parameter space proportionally to the associated posterior density - or, in other words - exploring uniformly across the volume between the zero plane and the surface defined by density (probability mass). For simplicity, we will ignore the log transform Stan actually doeas and talk directly about density in the rest of this post. Imagine the posterior density is a smooth wide hill:

Stan starts each iteration by moving across the posterior in random direction and then lets the density gradient steer the movement preferrentially to areas with high density. To explore the hill efficiently, we need to take quite large steps in this process - the chain of samples will represent the posterior well if it can move across the whole posterior in a small-ish number of steps (actually at most `2^max_treedepth`

steps). So average step size of something like 0.1 might be reasonable here as the posterior is approximately linear at this scale. We need to spend a bit more time around the center, but not that much, as there is a lot of volume also close to the edges - it has lower density, but it is a larger area.

Now imagine that the posterior is much sharper:

Now we need much smaller steps to explore safely. Step size of 0.1 won’t work as the posterior is non-linear on this scale, which will result in divergences. The sampler is however able to adapt and chooses a smaller step size accordingly. Another thing Stan will do is to rescale dimensions where the posterior is narrow. In the example above, posterior is narrower in `y`

and thus this dimension will be inflated to roughly match the spread in `x`

. Keep in mind that Stan rescales each dimension separately (the posterior is transformed by a diagonal matrix).

Now what if the posterior is a combination of both a “smooth hill” and a “sharp mountain”?

The sampler should spend about half the time in the “sharp mountain” and the other half in the “smooth hill”, but those regions need different step sizes and the sampler only takes one step size. There is also no way to rescale the dimensions to compensate. A chain that adapted to the “smooth hill” region will experience divergences in the “sharp mountain” region, a chain that adapted to the “sharp mountain” will not move efficiently in the “smooth hill” region (which will be signalled as transitions exceeding maximum treedepth). The latter case is however less likely, as the “smooth hill” is larger and chains are more likely to start there. I *think* that this is why problems of this kind mostly manifest as divergences and less likely as exceeding maximum treedepth.

This is only one of many reasons why multimodal posterior hurts sampling. Multimodality is problematic even if all modes are similar - one of the other problems is that traversing between modes might require much larger step size than exploration within each mode, as in this example:

I bet Stan devs would add tons of other reasons why multimodality is bad for you (it really is), but I’ll stop here and move to other possible sources of divergences.

The posterior geometry may be problematic, even if it is unimodal. A typical example is a funnel, which often arises in multi-level models:

Here, the sampler should spend a lot of time near the peak (where it needs small steps), but a non-negligible volume is also found in the relatively low-density but large area on the right where a larger step size is required. Once again, there is no way to rescale each dimension independently to selectively “stretch” the area around the peak. Similar problems also arise with large regions of constant or almost constant density combined with a single mode.

Last, but not least, lets look at tight correlation between variables, which is a different but frequent problem:

The problem is that if we are moving in the direction of the ridge, we need large step size, but when we move tangentially to that direction, we need small step size. Once again, Stan is unable to rescale the posterior to compensate as scaling `x`

or `y`

on its own will increase both width and length of the ridge.

Things get even more insidious when the relationship between the two variables is not linear:

Here, a good step size is a function of both location (smaller near the peak) and direction (larger when following the spiral) making this kind of posterior hard to sample.

This has been all pretty vague and folksy. Remeber these examples are there just to provide intuition. To be 100% correct, you need to go to the NUTS paper and/or the Conceptual Introduction to HMC paper and delve in the math. The math is always correct.

In particular all the above geometries **may** be difficult for NUTS and seeing them in visualisations hints at possible issues, but they **may** also be handled just fine. In fact, I wouldn’t be surprised if Stan worked with almost anything in two dimensions. Weak linear correlations that form wide ridges are also - in my experience - quite likely to be sampled well, even in higher dimensions. The issues arise when regions of non-negligible density are very narrow in some directions and much wider in others and rescaling each dimension individually won’t help. And finally, keep in mind that the posterios we discussed are even more difficult for Gibbs or other older samplers - and Gibbs will not even let you know there was a problem.

The amazing thing about divergences is that what is essentially a numerical problem actually signals a wide array of possibly severe modelling problems. Be glad - few algorithms (in any area) have such a clear signal that things went wrong. This is also the reason why you should be suspicious about your results even when only a single divergence had been reported - you don’t know what is hiding in the parts of your posterior that are inaccessible with the current step size.

That’s all for now. Hope to see you in the future with examples of actual diverging Stan models.

```
launch_shinystan_nonblocking <- function(fit) {
library(future)
plan(multisession)
future(
launch_shinystan(fit) #You can replace this with any other Shiny app
)
}
```

Hope that helps!

]]>**The TLDR is:** For this model, Stan scales considerably better than JAGS, but still cannot scale to very large model. Also, for this model Stan and INLA give almost the same results. It seems that Stan becomes useful only when your model cannot be coded in INLA.

Pleas let me know (via an issue on GitHub) should you find any error or anything else that should be included in this post. Also, if you run the experiment on a different machine and/or with different seed, let me know the results.

Here are the original numbers from Kathryn’s blog:

N | kathryn_rjags | kathryn_rinla |
---|---|---|

100 | 30.394 | 0.383 |

500 | 142.532 | 1.243 |

5000 | 1714.468 | 5.768 |

25000 | 8610.32 | 30.077 |

100000 | got bored after 6 hours | 166.819 |

*Full source of this post is available at this blog’s Github repo. Keep in mind that installing RStan is unfortunately not as straightforward as running install.packages. Please consult https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started if you don’t have RStan already installed.*

The model we are interested in is a simple GLM with partial pooling of a random effect:

```
y_i ~ poisson(mu_i)
log(mu_i) ~ alpha + beta * x_i + nu_i
nu_i ~ normal(0, tau_nu)
```

Let’s setup our libraries.

```
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
library(INLA)
library(tidyverse)
set.seed(6619414)
```

The results are stored in files within the repository to let me rebuild the site with blogdown easily. Delete cache directory to force a complete rerun.

```
cache_dir = "_stan_vs_inla_cache/"
if(!dir.exists(cache_dir)){
dir.create(cache_dir)
}
```

Let’s start by simulating data

```
#The sizes of datasets to work with
N_values = c(100, 500, 5000, 25000)
data = list()
for(N in N_values) {
x = rnorm(N, mean=5,sd=1)
nu = rnorm(N,0,0.1)
mu = exp(1 + 0.5*x + nu)
y = rpois(N,mu)
data[[N]] = list(
N = N,
x = x,
y = y
)
}
```

Here is the model code in Stan (it is good practice to put it into a file, but I wanted to make this post self-contained). It is almost 1-1 rewrite of the original JAGS code, with few important changes:

- JAGS parametrizes normal distribution via precision, Stan via sd. The model recomputes precision to sd.
- I added the ability to explicitly set parameters of the prior distributions as data which is useful later in this post
- With multilevel models, Stan works waaaaaay better with so-called non-centered parametrization. This means that instead of having
`nu ~ N(0, nu_sigma), mu = alpha + beta * x + nu`

we have`nu_normalized ~ N(0,1), mu = alpha + beta * x + nu_normalized * nu_sigma`

. This gives exactly the same inferences, but results in a geometry that Stan can explore efficiently.

There are also packages to let you specify common models (including this one) without writing Stan code, using syntax similar to R-INLA - checkout rstanarm and brms. The latter is more flexible, while the former is easier to install, as it does not depend on rstan and can be installed simply with `install.packages`

.

Note also that Stan developers would suggest against Gamma(0.01,0.01) prior on precision in favor of normal or Cauchy distribution on sd, see https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations.

```
model_code = "
data {
int N;
vector[N] x;
int y[N];
//Allowing to parametrize the priors (useful later)
real alpha_prior_mean;
real beta_prior_mean;
real<lower=0> alpha_beta_prior_precision;
real<lower=0> tau_nu_prior_shape;
real<lower=0> tau_nu_prior_rate;
}
transformed data {
//Stan parametrizes normal with sd not precision
real alpha_beta_prior_sigma = sqrt(1 / alpha_beta_prior_precision);
}
parameters {
real alpha;
real beta;
vector[N] nu_normalized;
real<lower=0> tau_nu;
}
model {
real nu_sigma = sqrt(1 / tau_nu);
vector[N] nu = nu_normalized * nu_sigma;
//taking advantage of Stan's implicit vectorization here
nu_normalized ~ normal(0,1);
//The built-in poisson_log(x) === poisson(exp(x))
y ~ poisson_log(alpha + beta*x + nu);
alpha ~ normal(alpha_prior_mean, alpha_beta_prior_sigma);
beta ~ normal(beta_prior_mean, alpha_beta_prior_sigma);
tau_nu ~ gamma(tau_nu_prior_shape,tau_nu_prior_rate);
}
//Uncomment this to have the model generate mu values as well
//Currently commented out as storing the samples of mu consumes
//a lot of memory for the big models
/*
generated quantities {
vector[N] mu = exp(alpha + beta*x + nu_normalized * nu_sigma);
}
*/
"
model = stan_model(model_code = model_code)
```

Below is the code to make the actual measurements. Some caveats:

- The compilation of the Stan model is not counted (you can avoid it with rstanarm and need to do it only once otherwise)
- There is some overhead in transferring the posterior samples from Stan to R. This overhead is non-negligible for the larger models, but you can get rid of it by storing the samples in a file and reading them separately. The overhead is not measured here.
- Stan took > 16 hours to converge for the largest data size (1e5) and then I had issues fitting the posterior samples into memory on my computer. Notably, R-Inla also crashed on my computer for this size. The largest size is thus excluded here, but I have to conclude that if you get bored after 6 hours, Stan is not practical for such a big model.
- I was not able to get rjags running in a reasonable amount of time, so I did not rerun the JAGS version of the model.

```
stan_times_file = paste0(cache_dir, "stan_times.csv")
stan_summary_file = paste0(cache_dir, "stan_summary.csv")
run_stan = TRUE
if(file.exists(stan_times_file) && file.exists(stan_summary_file)) {
stan_times = read.csv(stan_times_file)
stan_summary = read.csv(stan_summary_file)
if(setequal(stan_times$N, N_values) && setequal(stan_summary$N, N_values)) {
run_stan = FALSE
}
}
if(run_stan) {
stan_times_values = numeric(length(N_values))
stan_summary_list = list()
step = 1
for(N in N_values) {
data_stan = data[[N]]
data_stan$alpha_prior_mean = 0
data_stan$beta_prior_mean = 0
data_stan$alpha_beta_prior_precision = 0.001
data_stan$tau_nu_prior_shape = 0.01
data_stan$tau_nu_prior_rate = 0.01
fit = sampling(model, data = data_stan);
stan_summary_list[[step]] =
as.data.frame(
rstan::summary(fit, pars = c("alpha","beta","tau_nu"))$summary
) %>% rownames_to_column("parameter")
stan_summary_list[[step]]$N = N
all_times = get_elapsed_time(fit)
stan_times_values[step] = max(all_times[,"warmup"] + all_times[,"sample"])
step = step + 1
}
stan_times = data.frame(N = N_values, stan_time = stan_times_values)
stan_summary = do.call(rbind, stan_summary_list)
write.csv(stan_times, stan_times_file,row.names = FALSE)
write.csv(stan_summary, stan_summary_file,row.names = FALSE)
}
```

```
inla_times_file = paste0(cache_dir,"inla_times.csv")
inla_summary_file = paste0(cache_dir,"inla_summary.csv")
run_inla = TRUE
if(file.exists(inla_times_file) && file.exists(inla_summary_file)) {
inla_times = read.csv(inla_times_file)
inla_summary = read.csv(inla_summary_file)
if(setequal(inla_times$N, N_values) && setequal(inla_summary$N, N_values)) {
run_inla = FALSE
}
}
if(run_inla) {
inla_times_values = numeric(length(N_values))
inla_summary_list = list()
step = 1
for(N in N_values) {
nu = 1:N
fit_inla = inla(y ~ x + f(nu,model="iid"), family = c("poisson"),
data = data[[N]], control.predictor=list(link=1))
inla_times_values[step] = fit_inla$cpu.used["Total"]
inla_summary_list[[step]] =
rbind(fit_inla$summary.fixed %>% select(-kld),
fit_inla$summary.hyperpar) %>%
rownames_to_column("parameter")
inla_summary_list[[step]]$N = N
step = step + 1
}
inla_times = data.frame(N = N_values, inla_time = inla_times_values)
inla_summary = do.call(rbind, inla_summary_list)
write.csv(inla_times, inla_times_file,row.names = FALSE)
write.csv(inla_summary, inla_summary_file,row.names = FALSE)
}
```

Here we see side-by-side comparisons of the inferences and they seem pretty comparable between Stan and Inla:

```
for(N_to_show in N_values) {
print(kable(stan_summary %>% filter(N == N_to_show) %>%
select(c("parameter","mean","sd")),
caption = paste0("Stan results for N = ", N_to_show)))
print(kable(inla_summary %>% filter(N == N_to_show) %>%
select(c("parameter","mean","sd")),
caption = paste0("INLA results for N = ", N_to_show)))
}
```

parameter | mean | sd |
---|---|---|

alpha | 1.013559 | 0.0989778 |

beta | 0.495539 | 0.0176988 |

tau_nu | 162.001608 | 82.7700473 |

parameter | mean | sd |
---|---|---|

(Intercept) | 1.009037e+00 | 9.15248e-02 |

x | 4.971302e-01 | 1.61486e-02 |

Precision for nu | 1.819654e+04 | 1.71676e+04 |

parameter | mean | sd |
---|---|---|

alpha | 1.0046284 | 0.0555134 |

beta | 0.4977522 | 0.0102697 |

tau_nu | 71.6301530 | 13.8264812 |

parameter | mean | sd |
---|---|---|

(Intercept) | 1.0053202 | 0.0538456 |

x | 0.4977124 | 0.0099593 |

Precision for nu | 77.3311793 | 16.0255430 |

parameter | mean | sd |
---|---|---|

alpha | 1.009930 | 0.0159586 |

beta | 0.496859 | 0.0029250 |

tau_nu | 101.548580 | 7.4655716 |

parameter | mean | sd |
---|---|---|

(Intercept) | 1.0099282 | 0.0155388 |

x | 0.4968718 | 0.0028618 |

Precision for nu | 103.1508773 | 7.6811740 |

parameter | mean | sd |
---|---|---|

alpha | 0.9874707 | 0.0066864 |

beta | 0.5019566 | 0.0012195 |

tau_nu | 104.3599424 | 3.5391938 |

parameter | mean | sd |
---|---|---|

(Intercept) | 0.9876218 | 0.0067978 |

x | 0.5019341 | 0.0012452 |

Precision for nu | 104.8948949 | 3.4415929 |

You can see that Stan keeps reasonable runtimes for longer time than JAGS in the original blog post, but INLA is still way faster. Also Kathryn got probably very lucky with her seed for N = 25 000, as her INLA run completed very quickly. With my (few) tests, INLA always took at least several minutes for N = 25 000. It may mean that Kathryn’s JAGS time is also too short.

```
my_results = merge.data.frame(inla_times, stan_times, by = "N")
kable(merge.data.frame(my_results, kathryn_results, by = "N"))
```

N | inla_time | stan_time | kathryn_rjags | kathryn_rinla |
---|---|---|---|---|

100 | 1.061742 | 1.885 | 30.394 | 0.383 |

500 | 1.401597 | 11.120 | 142.532 | 1.243 |

5000 | 10.608704 | 388.514 | 1714.468 | 5.768 |

25000 | 611.505543 | 5807.670 | 8610.32 | 30.077 |

You could obviously do multiple runs to reduce uncertainty etc., but this post has already taken too much time of mine, so this will be left to others.

I also had a hunch that maybe INLA is less precise than Stan, but that turned out to be based on an error. Thus, without much commentary, I put here my code to test this. Basically, I modify the random data generator to actually draw from priors (those priors are quite constrained to provide similar values of alpha, beta nad tau_nu as in the original). I than give both algorithms the knowledge of these priors. I compute both difference between true parameters and a point estimate (mean) and quantiles of the posterior distribution where the true parameter is found. If the algorithms give the best possible estimates, the distribution of such quantiles should be uniform over (0,1). Turns out INLA and Stan give almost exactly the same results for almost all runs and the differences in quality are (for this particular model) negligible.

```
test_precision = function(N) {
rejects <- 0
repeat {
#Set the priors so that they generate similar parameters as in the example above
alpha_beta_prior_precision = 5
prior_sigma = sqrt(1/alpha_beta_prior_precision)
alpha_prior_mean = 1
beta_prior_mean = 0.5
alpha = rnorm(1, alpha_prior_mean, prior_sigma)
beta = rnorm(1, beta_prior_mean, prior_sigma)
tau_nu_prior_shape = 2
tau_nu_prior_rate = 0.01
tau_nu = rgamma(1,tau_nu_prior_shape,tau_nu_prior_rate)
sigma_nu = sqrt(1 / tau_nu)
x = rnorm(N, mean=5,sd=1)
nu = rnorm(N,0,sigma_nu)
linear = alpha + beta*x + nu
#Rejection sampling to avoid NAs and ill-posed problems
if(max(linear) < 15) {
mu = exp(linear)
y = rpois(N,mu)
if(mean(y == 0) < 0.7) {
break;
}
}
rejects = rejects + 1
}
#cat(rejects, "rejects\n")
data = list(
N = N,
x = x,
y = y
)
#cat("A:",alpha,"B:", beta, "T:", tau_nu,"\n")
#print(linear)
#print(data)
#=============== Fit INLA
nu = 1:N
fit_inla = inla(y ~ x + f(nu,model="iid",
hyper=list(theta=list(prior="loggamma",
param=c(tau_nu_prior_shape,tau_nu_prior_rate)))),
family = c("poisson"),
control.fixed = list(mean = beta_prior_mean,
mean.intercept = alpha_prior_mean,
prec = alpha_beta_prior_precision,
prec.intercept = alpha_beta_prior_precision
),
data = data, control.predictor=list(link=1)
)
time_inla = fit_inla$cpu.used["Total"]
alpha_mean_diff_inla = fit_inla$summary.fixed["(Intercept)","mean"] - alpha
beta_mean_diff_inla = fit_inla$summary.fixed["x","mean"] - beta
tau_nu_mean_diff_inla = fit_inla$summary.hyperpar[,"mean"] - tau_nu
alpha_q_inla = inla.pmarginal(alpha, fit_inla$marginals.fixed$`(Intercept)`)
beta_q_inla = inla.pmarginal(beta, fit_inla$marginals.fixed$x)
tau_nu_q_inla = inla.pmarginal(tau_nu, fit_inla$marginals.hyperpar$`Precision for nu`)
#================ Fit Stan
data_stan = data
data_stan$alpha_prior_mean = alpha_prior_mean
data_stan$beta_prior_mean = beta_prior_mean
data_stan$alpha_beta_prior_precision = alpha_beta_prior_precision
data_stan$tau_nu_prior_shape = tau_nu_prior_shape
data_stan$tau_nu_prior_rate = tau_nu_prior_rate
fit = sampling(model, data = data_stan, control = list(adapt_delta = 0.95));
all_times = get_elapsed_time(fit)
max_total_time_stan = max(all_times[,"warmup"] + all_times[,"sample"])
samples = rstan::extract(fit, pars = c("alpha","beta","tau_nu"))
alpha_mean_diff_stan = mean(samples$alpha) - alpha
beta_mean_diff_stan = mean(samples$beta) - beta
tau_nu_mean_diff_stan = mean(samples$tau_nu) - tau_nu
alpha_q_stan = ecdf(samples$alpha)(alpha)
beta_q_stan = ecdf(samples$beta)(beta)
tau_nu_q_stan = ecdf(samples$tau_nu)(tau_nu)
return(data.frame(time_rstan = max_total_time_stan,
time_rinla = time_inla,
alpha_mean_diff_stan = alpha_mean_diff_stan,
beta_mean_diff_stan = beta_mean_diff_stan,
tau_nu_mean_diff_stan = tau_nu_mean_diff_stan,
alpha_q_stan = alpha_q_stan,
beta_q_stan = beta_q_stan,
tau_nu_q_stan = tau_nu_q_stan,
alpha_mean_diff_inla = alpha_mean_diff_inla,
beta_mean_diff_inla = beta_mean_diff_inla,
tau_nu_mean_diff_inla = tau_nu_mean_diff_inla,
alpha_q_inla= alpha_q_inla,
beta_q_inla = beta_q_inla,
tau_nu_q_inla = tau_nu_q_inla
))
}
```

Actually running the comparison. On some occasions, Stan does not converge, my best guess is that the data are somehow pathological, but I didn’t investigate thoroughly. You see that results for Stan and Inla are very similar both as point estimates and the distribution of posterior quantiles. The accuracy of the INLA approximation is also AFAIK going to improve with more data.

```
library(skimr) #Uses skimr to summarize results easily
precision_results_file = paste0(cache_dir,"precision_results.csv")
if(file.exists(precision_results_file)) {
results_precision_df = read.csv(precision_results_file)
} else {
results_precision = list()
for(i in 1:100) {
results_precision[[i]] = test_precision(50)
}
results_precision_df = do.call(rbind, results_precision)
write.csv(results_precision_df,precision_results_file,row.names = FALSE)
}
#Remove uninteresting skim statistics
skim_with(numeric = list(missing = NULL, complete = NULL, n = NULL))
skimmed = results_precision_df %>% select(-X) %>% skim()
#Now a hack to display skim histograms properly in the output:
skimmed_better = skimmed %>% rowwise() %>% mutate(formatted =
if_else(stat == "hist",
utf8ToInt(formatted) %>% as.character() %>% paste0("&#", . ,";", collapse = ""),
formatted))
mostattributes(skimmed_better) = attributes(skimmed)
skimmed_better %>% kable(escape = FALSE)
```

Skim summary statistics

n obs: 100

n variables: 14

Variable type: numeric

variable | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|

alpha_mean_diff_inla | -0.0021 | 0.2 | -0.85 | -0.094 | 0.0023 | 0.095 | 0.53 | ▁▁▁▂▇▇▁▁ |

alpha_mean_diff_stan | -0.0033 | 0.2 | -0.84 | -0.097 | -0.00012 | 0.093 | 0.52 | ▁▁▁▂▇▇▁▂ |

alpha_q_inla | 0.5 | 0.29 | 0.00084 | 0.25 | 0.5 | 0.73 | 0.99 | ▅▇▇▆▇▆▆▇ |

alpha_q_stan | 0.5 | 0.28 | 0.001 | 0.26 | 0.5 | 0.73 | 0.99 | ▅▇▇▆▇▆▆▇ |

beta_mean_diff_inla | -0.00088 | 0.04 | -0.12 | -0.016 | -0.001 | 0.014 | 0.17 | ▁▁▃▇▂▁▁▁ |

beta_mean_diff_stan | -0.001 | 0.04 | -0.12 | -0.016 | -5e-04 | 0.014 | 0.16 | ▁▁▂▇▂▁▁▁ |

beta_q_inla | 0.51 | 0.28 | 0.0068 | 0.26 | 0.52 | 0.75 | 1 | ▆▆▅▆▇▅▆▆ |

beta_q_stan | 0.51 | 0.28 | 0.0065 | 0.27 | 0.51 | 0.75 | 1 | ▆▆▅▇▆▅▆▆ |

tau_nu_mean_diff_inla | 4.45 | 90.17 | -338.58 | -26.74 | 4.49 | 53.38 | 193 | ▁▁▁▂▅▇▃▂ |

tau_nu_mean_diff_stan | 5.21 | 90 | -339.89 | -24.62 | 4.29 | 54.48 | 191.94 | ▁▁▁▂▅▇▃▂ |

tau_nu_q_inla | 0.53 | 0.26 | 0.023 | 0.32 | 0.52 | 0.74 | 0.99 | ▃▅▆▆▇▆▅▅ |

tau_nu_q_stan | 0.53 | 0.26 | 0.021 | 0.32 | 0.53 | 0.75 | 0.99 | ▃▅▅▆▇▃▅▅ |

time_rinla | 0.97 | 0.093 | 0.86 | 0.91 | 0.93 | 0.98 | 1.32 | ▇▇▂▁▁▁▁▁ |

time_rstan | 1.79 | 1.4 | 0.55 | 0.89 | 1.45 | 2.09 | 10.04 | ▇▂▁▁▁▁▁▁ |

A quick disclaimer first: I am no expert in either language and while I am starting to feel comfortable in Elm, I am undoubtedly a Rust beginner, so please correct me if I am doing injustice to any of the languages.

Rust is a systems language which aims to compete with C++. Rust values performance, concurrency, and memory safety, but is not garbage-collected. Rust compiles to native binaries, not only for the major x86/x64 platforms but also on ARM and even certain ARM-based microcontrollers.

Elm is a language for web apps competing with Javascript in general and the virtual DOM frameworks in particular (e.g. ReactJS). Elm compiles to Javascript, is garbage collected and purely functional. Elm values simplicity and reliability.

Both languages are already usable for actual projects, but the ecosystems are still immature and the languages themselves are still evolving.

While I like both of the languages, I do not intend to limit this post to the positive sides and will also mention what are (to me) the pain points. I will start with the ideas the languages have in common, and will give more details about either language later.

The features described here are mostly nothing completely new and could be found in languages like OCaml, Haskell and F#. The interesting part is that Elm and Rust prove they are useful for quite diverse use-cases.

This is a small but very practical feature - I would say tagged unions are enums on steroids. Consider, how often did you write something like:

```
enum AccountType {Savings, CreditCard};
//In real code please use Decimal types to represent money. Please.
class CreditParams {
int creditLimit;
...
}
class Account {
AccountType accountType;
int balance;
//only present for CreditCard, always null for Savings
CreditParams creditParams;
}
```

This makes room for some sweet bugs, as your data model can represent a state that should be impossible (savings account with non-null credit parameters, or a credit card account with null credit parameters). The programmer needs to take care that no manipulation of the `Account`

object can lead to such a state which may be non-trivial and error-prone. It also creates ambiguity - for example, there are multiple ways to get the credit limit of an account:

```
//Yes I know, this should be a class method
int getCreditLimit1(Account account) {
if (account.creditParams != null) {
//wrong if account.accountType == Savings
return account.creditParams.creditLimit;
} else {
return 0;
}
}
int getCreditLimit2(Account account) {
if (account.accountType == CreditCard) {
//possibly accessing a null pointer
return account.creditParams.creditLimit;
} else {
return 0;
}
}
```

A more desirable option is to make impossible states impossible. Tagged unions let you do this by attaching heterogenous data to each variant. This lets us rewrite the data model as (Rust syntax, try it online):

```
struct CreditParams {
credit_limit: i32, //i32 is a 32bit signed int
...
}
enum AccountDetails {
Savings, //Savings has no attached data
CreditCard(CreditParams), //CreditCard has a single CreditParams instance
}
struct Account {
balance: i32,
details: AccountDetails,
}
```

With tagged unions, you cannot access the attached data without explicitly checking the type - so there is only one way to get the credit limit and it is always correct (Rust syntax, try it online):

```
fn get_credit_limit(account: Account) -> i32 {
match account.details { //match is like case
AccountDetails::CreditCard(params) => //bind local variable params to the attached data
params.credit_limit, //in Rust, return is implicit
AccountDetails::Savings =>
0
}
}
```

Since both Elm and Rust don’t have null values, you have to specify `CreditParams`

when building an `AccountDetails`

instance, and so the code above is safe in all situations.

A further bonus is that in both Elm and Rust, you have to handle all possible cases of a tagged union (or provide a default branch). Failing to handle all cases is a compile-time error. In this way, the compiler makes sure that you update all your code when you extend the `AcountDetails`

.

Some people are fond of static typing as it is harder to write erroneous code in statically-typed languages. Some poeple like dynamic typing, because it avoids the bureacracy of adding type annotations to everything. Type inference tries to get the best of both worlds: the language is statically typed, but you rarely need to provide type annotations. Type inference in Rust and Elm works a bit like `auto`

in C++, but it is much more powerful - it looks at broader context and takes also downstream code into consideration. So for example (Rust syntax, try it online)

```
// The compiler infers that elem is a float.
let elem = 1.36;
//Explicit type annotation - f64 is a double precision float
let elem2: f64 = 3.141592;
// Create an empty vector (a growable array).
let mut vec = Vec::new();
// At this point the compiler doesn't know the exact type of `vec`, it
// just knows that it's a vector of something (`Vec<_>`).
// Insert `elem` and `elem2` in the vector.
vec.push(elem);
vec.push(elem2);
// Aha! Now the compiler knows that `vec` is a vector of doubles (`Vec<f64>`)
//The compiler infers that s is a &str (reference to string)
let s = "Hello";
//Compile-time error: expected floating-point variable, found &str
vec.push(s);
```

Type inference in Rust has certain limitations and so explicit type annotations are still needed now and then. But Elm goes further, implementing a variant of the Hindley-Milner type system. In practice this means that type annotations in Elm are basically just comments (except some weird corner cases). While the Elm compiler enforces that type annotations match the code, they can be omitted and the compiler will still statically typecheck everything. Nevertheless, it is a warning to not annotate your functions with types, as type annotations let the compiler give you better error messages and force you to articulate your intent clearly.

Immutability means that variables/data cannot be modified after initial assignment/creation. Another way to state it is that operations on immutable data can have no observable effect except for returning a value. This implies that functions on immutable data will always return the same value for the same arguments. Code working with immutable data is easier to understand and reason about and is inherently thread-safe. Consider this code with mutable data:

```
address = new Address();
address.street = "Mullholland Drive";
...
person = new Person();
person.primaryAddress = address;
print(person.primaryAddress.street) //Mullholland Drive
...
address.street = "Park Avenue"
...
print(person.primaryAddress.street) //Park Avenue
```

Now let’s say we want to figure out why `person.primaryAddress.street`

changed. Since the data is mutable, it is not sufficient to find all usages of `person.primaryAddress`

- we also need to check the whole tree of all variables/fields that were assigned to/from `person.primaryAddress`

. With immutable data structures this problem is prevented as the programmer is forced to write something like:

```
address = new Address("Mullholland Drive", 1035, "California");
//Elm and Rust also support syntax of the form:
//address = { street = "Mullholland Drive", number = 1035, state = "California" }
...
person = new Person(address);
...
address.street = "Park Avenue" //not allowed, the object is immutable
```

For a more detailed discussion of why immutability is good, see for example 3 benefits of using Immutable Objects.

Elm goes all-in on immutability - everything is immutable and no function can have a side effect. Rust is a bit more relaxed: in Rust, you have to opt-in for mutability and the compiler ensures that as long as a piece of data can be changed within a code segment (there is a mutable reference to the data), no other code path can read or modify the same data.

Making sure that the data you are referencing cannot change without your cooperation generally makes your life easier. Unless this is EXACTLY what you want to achieve. Let’s say you are writing a traffic monitoring tool. You might want to model your data like this (Elm syntax):

```
-- In Elm, double dash marks a comment
type alias City = --Curly braces declare a record type, a bit like an object
{ name: String
, routes: List Route --list of Route instances
}
type alias Route =
{ from: City
, to: City
, trafficLevel: Float
}
type alias World =
{ cities: List City
, routes: List Route
}
```

You may expect that when you receive new traffic information, you simply work with `World.routes`

and the changes will be seen when accessing through `City.routes`

. But you would be mistaken. In Elm this will not even compile (fields in record types are fully expanded at compile time, and thus cannot have circular references). And if you use tagged unions to make the model compile, the `trafficLevel`

accessed via `World.routes`

may not be the same as one accessed via `City.routes`

, as those always behave as different instances.

A similar data model in Rust will compile but it will be difficult to actually instantiate the structure and you won’t be able to ever modify the `trafficLevel`

of any `Route`

instance, because the compiler won’t let you create a mutable reference to it (every `Route`

is referenced at least twice).

This brings us to a less talked-about implication of immutability: **immutable data structures are inherently tree-like**. In both Elm and Rust, it is a pain to work with graph-like structures and you have to give up some guarantees the languages give you.

In Elm, the only way to represent a graph is by using indices to a dictionary (map) instead of direct references. For the above example a practical data model could look like:

```
type alias RouteId = Int -- New types just for clarity
type alias CityId = Int
type alias City =
{ id: CityId
, name: String
, routes: List RouteId
}
type alias Route =
{ id: RouteId
, from: City
, to: City
, trafficLevel: Float
}
type alias World =
{ cities: Dict CityId City --dictionary (map) with CityId as keys and City as values
, routes: Dict RouteId Route
}
```

Notice that nothing prevents us from having an invalid `RouteId`

stored in `City.routes`

. While Elm gives you good tools to work with such a model (e.g., it forces you to always handle the case where a given `RouteId`

is not present in `World.routes`

), and the advantages for every other use case make this an acceptable cost, it is still a bit annoying.

Rust has a bit more options to work with graph-like data, but they all have downsides of their own (using indices, StackOverflow discussion, graphs using ref counting or arena allocation).

This is basically a generalization of the previous specific features. The compilers for Elm and Rust are powerful and they do a lot of stuff for you. They not only parse the code line-by-line, but they reason about your code in the context of the whole program. However, the most interesting thing about compilers for Rust and Elm is not what they let you do. It is what they DO NOT let you do (e.g., you cannot mix floats and ints without explicit conversion, you cannot get to the data stored in an tagged union without handling all possible cases, you cannot modify certain data etc.). At the same time, the compilers are smart enough to make conforming to these restrictions less of a chore. If you think that programmers will produce better code when given fewer limitations, think of the time people complained that restricting the use of GOTO hinders productivity.

Another way to formulate this stance is that languages should not strive to make best practices easy as much as they should make writing bad code hard. I think both languages achieve this to a good degree - writing *any* code is a bit harder than in their less restrictive relatives, but there is much less incentive to take shortcuts.

In practice, smart but restrictive compilers mean more time spent coding and less time spent debugging. Since debugging and reading messy code can be very time-consuming, this usually results in a net productivity gain. Personally, I love writing code, while debugging is often frustrating, so to me, this is a sweet deal.

Needless to say, all those restrictions make hacking one-off dirty solutions in Rust or Elm slightly annoying. But what code is truly one-off?

The communities of both Elm and Rust make a big push for consistent presentation of source code. At the very least, this reduces the need for lengthy project-specific style guidelines at every team using the language. To be specific, Elm compiler enforces indentation for certain language constructs, does not allow Tabs for identation(!) and enforces that types begin with an upper-case letter while functions begin in lower-case. Further, there is elm-format, a community-endorsed source formatter.

In a similar vein, Rust compiler gives warnings if you do not stick to official naming conventions and also provides a community-endorsed formatter rustfmt.

Now is the time to talk about the languages individually, if you are still interested. We will take Elm first. Elm is a simple, small language. The complete syntax can be documented on a single page. Elm aimes at people already using Javascript and strives for low barrier of entry. Elm is currently at version 0.18 and new releases regularly bring backwards-incompatible changes (although official conversion tools are available). An interesting thing is that over the last few versions more syntax elements were removed than added, testifying to the focus on language simplicity.

Elm is purely functional. This means there are no variables in the classical sense, everything is a function. How does an application evolve over time if there are no variables? This is handled by The Elm Architecture (TEA). On the most simplistic level, an Elm application consists primarily of an `update`

function and a `view`

function. The `update`

function takes a previous state of the application and input from the user/environment and returns a new state of the application. The `view`

function than takes the state of the application and returns a HTML representation. All changes to the application state thus happen outside of Elm code, within the native code in TEA. The architecture also provides the necessary magic to correctly and efficiently update the DOM to match the latest `view`

result.

TEA forces you to explicitly say what constitutes the state of your application and its inputs. This lets Elm to provide its killer feature: the time-travel debugger. In essence, when the debugger is turned on, you can replay the whole history of the application and inspect the application state at any point in past. And due to the way the language is designed, it works 100% of the time.

Another big plus of TEA is that you never have to worry about forgetting to hide an element when the user clicks a checkbox. If your `view`

function correctly displays the element based on the current application state, the element will also be automatically hidden once the application state changes again.

Further sweet things about Elm is the effort to have nice and helpful error messages, with a dedicated GitHub repository for suggesting error message improvements. Also the record system which gives you a lot of freedom in using structured types (e.g., you do not have to declare them before use), but at the same time is statically checked for correctness.

A big downside of TEA is that it assumes that all state of the application can be made explicit. This makes working with HTML elements that have a state of their own tricky in certain contexts (e.g., text area contents, caret position in text areas, Web Components). You need care to prevent TEA from messing with such components destructively. Further, TEA can be resource intensive, albeit less than comparable JS frameworks. Last but not least, creating large apps in Elm involves writing a significant amount of boilerplate code. The Elm community is still discussing how to develop large projects more easily.

Whoa, that’s a lot of new syntax!

(Rust book, section 4.34 on Macros )

In comparison with Elm, Rust is quite the beast. There is a lot of syntax and a lot of things to learn. This is however not unexpected: if you want to write fast code, you really need a lot of control. Also, C and especially C++ also have loads of syntax, so Rust is definitely not at a big disadvantage here. Rust is currently at version 1.15 and has forward compatibility guarantees.

While Rust is imperative, it took in a lot of useful functional programming concepts and boasts zero cost abstractions - i.e. that all the fancy syntactic tricks that let you develop code easily incur no actual performance penalty in comparison with a hand-tuned but dirty solution.

Rust also has no OOP of the usual kind, instead it has traits (a bit like interfaces) and deliberately avoids inheritance (you should compose instead).

The weirdest and most interesting part of Rust is the borrow checker. While Rust does not have managed memory (garbage collection), it can still guarantee that you cannot access uninitialized memory, dereference a null pointer or otherwise corrupt your memory. This has big implications not only for reliability but also for security, as Rust automatically prevents whole classes of severe attacks as buffer overflow or Heartbleed (blog post). Rust also prevents most (but not all) memory leaks. The borrow checker is what enables a big portion of those guarantees by validating that your program accesses memory correctly at *compile time*, i.e. without the runtime penalty of managed memory. The borrow checker ensures that a mutable reference to a piece of data cannot coexist with any other reference (and thus that you cannot free memory while holding a reference to it). For some intuition, mutable references in Rust behave a bit like `std::unique_ptr`

in C++ (specs), but with the uniqueness enforced at compile-time. More detailed description could not fit here, so check Rust by Example or just Google away :-).

The borrow checker is both the biggest strength and the biggest weakness of Rust. Although the Rust community took a lot of effort to make most code just work, you inevitably end up fighting the borrow checker. There are some promising updates to the borrow checker in the pipeline that could make the life of Rust programmer easier, but it will not be cakewalk anytime soon - making the compiler understand your program is hard (both for the programmer and for the compiler).

While Rust takes performance seriously and the compiler should *in theory* be able to do a lot more optimizations than C/C++, Rust is not quite there yet. Benchmarks I’ve seen put it equal or slightly behind C/C++ on gcc (e.g. Benchmarks game). From my memory gcc also used to produce slower code than MSVC or the Intel compiler which would be bad news for Rust. The Internet however suggests that recent gcc is on par with MSVC/Intel, but I was unable to find any good benchmark link.

Development in Rust also still has some rough edges, IDE support is incomplete - setting up a decent debug environment maybe as much as a 14-step process and still the features are limited.

The same way functional programming has made its way from fringes to being included in mainstream languages, I believe the features that make both Elm and Rust interesting will show up in the mainstream. Some of the ideas can also be immediately transferred to the current languages (e.g. ImmutableJS). I think the take-home message of this post is that you should consider learning a new language. Preferably one that is very different from what you have been working with so far. No only it is fun, it will make you a better programmer in your language of choice.

I’ll be very happy if you provide your feedback on this post either here, on my Twitter or on Reddit.

]]>I also do a bit of statistical consulting on other people’s projects. Get in touch!

I also used to do research in AI for computer games. And I have a defunct software engineering blog.

E-mail (personal): modrak.mar@gmail.com

Twitter: @modrak_m

]]>- Developing and debugging statistical models (particularly Stan and INLA)
- Exploratory data analysis
- Statistical graphics

If I have time, I will answer to small inquiries on Stan Discourse, tag me in as @martinmodrak to get my attention. You can check out list of my previous answers.

I am privileged to have a lot of freedom in what I do in research. So within the academic sphere, I am generally open to more substantial consulting in exchange for co-authorship, details depend on the project. You can check out my published research output on my ORCID, some not officialy published stuff is also at Zenodo.

Commercial and NGO consulting is available on a case-by-case basis, though I have less experience in this area. Currently, I am helping the Czech Scout Movement use insights from data to better harness and grow the potential of its young adult members.

]]>