+ - 0:00:00
Notes for current slide
Notes for next slide

Excercise 3

Martin Modrák

2021/08/24 (updated: 2022-01-26)

1 / 13

A detective story

We have a trustworthy generator of data for linear regression and 3 suspicious implementations in Stan. Can you tell which is telling the truth? Maybe all are wrong? Maybe all are correct?

2 / 13

Generator

It is often useful to implement the generator code in the most simple and stupid way to reduce the probability of errors. Keep all the fancy optimizations for your Stan code!

We promise the code below is correct. N is the number of data points, K the number of predictors.

single_sim_regression <- function(N, K) {
x <- matrix(rnorm(n = N * K, mean = 0, sd = 1),
nrow = N, ncol = K)
alpha <- rnorm(n = 1, mean = 0, sd = 1)
beta <- rnorm(n = K, mean = 0, sd = 1)
sigma <- abs(rnorm(n = 1, mean = 0, sd = 2))
y <- array(NA_real_, N)
for(n in 1:N) {
mu <- alpha
for(k in 1:K) {
mu <- mu + x[n,k] * beta[k]
}
y[n] <- rnorm(n = 1, mean = mu, sd = sigma)
}
list(
variables = list(
alpha = alpha,
beta = beta,
sigma = sigma),
generated = list(
N = N, K = K,
x = x, y = y
)
)
}
3 / 13

Task 1/2

Build the generator and run simulations. Once again, 10 simulations should serve us well.

generator_regression <- SBC_generator_function(
single_sim_regression, N = 100, K = 2)
datasets_regression <- generate_datasets(
generator_regression, 10)
4 / 13

Suspect 1

Already saved in regression_1.stan if you ran the setup script, otherwise store it in this file yourself :-)

data {
int<lower=0> N; // number of data items
int<lower=0> K; // number of predictors
matrix[N, K] x; // predictor matrix
vector[N] y; // outcome vector
}
parameters {
real alpha; // intercept
vector[K] beta; // coefficients for predictors
real<lower=0> sigma; // error scale
}
model {
vector[N] mu = rep_vector(alpha, N);
for(i in 1:K) {
for(j in 1:N) {
mu[j] += beta[i] * x[j, i];
}
}
y ~ normal(mu, sigma); // likelihood
alpha ~ normal(0, 5);
beta ~ normal(0, 1);
sigma ~ normal(0, 2);
}
5 / 13

Suspect 2

Store in a file called regression_2.stan.

data {
int<lower=0> N; // number of data items
int<lower=0> K; // number of predictors
matrix[N, K] x; // predictor matrix
vector[N] y; // outcome vector
}
parameters {
real alpha; // intercept
vector[K] beta; // coefficients for predictors
real<lower=0> sigma; // error scale
}
model {
vector[N] mu;
for(i in 1:N) {
mu[i] = alpha;
for(j in 1:K) {
mu[i] += beta[j] * x[j, j];
}
}
y ~ normal(mu, sigma); // likelihood
alpha ~ normal(0, 5);
beta ~ normal(0, 1);
sigma ~ normal(0, 2);
}
6 / 13

Suspect 3

Store in a file called regression_3.stan.

data {
int<lower=0> N; // number of data items
int<lower=0> K; // number of predictors
matrix[N, K] x; // predictor matrix
vector[N] y; // outcome vector
}
parameters {
real alpha; // intercept
vector[K] beta; // coefficients for predictors
real<lower=0> sigma; // error scale
}
model {
y ~ normal(transpose(beta) * transpose(x) + alpha, sigma); // likelihood
alpha ~ normal(0, 5);
beta ~ normal(0, 1);
sigma ~ normal(0, 2);
}
7 / 13

Question 1/2

Make a guess - which of the suspects you believe implement linear regression correctly? (could be all three) Which of the suspects will not give correct results? (could be all three).

Don't spend too much time investigating, if you don't see it - SBC is going to help us!

8 / 13

Task 2/2 - Run SBC

Backend with cmdstanr

model_regression_1 <- cmdstan_model("regression_1.stan")
backend_regression_1 <- SBC_backend_cmdstan_sample(
model_regression_1,
iter_warmup = 400, iter_sampling = 500)

Backend with rstan

model_regression_1 <- stan_model("regression_1.stan")
backend_regression_1 <- SBC_backend_rstan_sample(
model_regression_1,
iter = 900, warmup = 400)

SBC + plots

results_regression_1 <-
compute_SBC(datasets_regression, backend_regression_1)
plot_rank_hist(results_regression_1)
plot_ecdf(results_regression_1)
plot_ecdf_diff(results_regression_1)

(Repeat for all 3 suspects)

9 / 13

Question 2/2

Which of the suspects were caught misleading us using SBC? Which variables are affected the most and in which direction?

10 / 13

Bonus Task 1

Figure out what went wrong in the problematic model(s). Can you fix the problem(s)?

There is a hint on the next slide.

11 / 13

Bonus Task 1 - Hint

Indexing and for loops are a frequent source of typos. Some letters just look quite similar.

12 / 13

Bonus Task 2

Can you explain why the problematic model(s) produced the SBC plots they did?

Bonus Task 3

Look at results_regression_X$stats for the problematic models. Would it be possible to use z_score to diagnose the problem faster?

13 / 13

A detective story

We have a trustworthy generator of data for linear regression and 3 suspicious implementations in Stan. Can you tell which is telling the truth? Maybe all are wrong? Maybe all are correct?

2 / 13
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow