Excercise 3Martin Modrák2021/08/24 (updated: 2022-01-26)1 / 13

A detective story

We have a trustworthy generator of data for linear regression and 3 suspicious implementations in Stan. Can you tell which is telling the truth? Maybe all are wrong? Maybe all are correct?

2 / 13

Generator

It is often useful to implement the generator code in the most simple and stupid way to reduce the probability of errors. Keep all the fancy optimizations for your Stan code!

We promise the code below is correct. N is the number of data points, K the number of predictors.

single_sim_regression <- function(N, K) {
  x <- matrix(rnorm(n = N * K, mean = 0, sd = 1), 
              nrow = N, ncol = K)
  alpha <- rnorm(n = 1, mean = 0, sd = 1)
  beta <- rnorm(n = K, mean = 0, sd = 1)
  sigma <- abs(rnorm(n = 1, mean = 0, sd = 2))
  y <- array(NA_real_, N)
  for(n in 1:N) {
    mu <- alpha
    for(k in 1:K) {
      mu <- mu + x[n,k] * beta[k]
    }
    y[n] <- rnorm(n = 1, mean = mu, sd = sigma) 
  }
  list(
    variables = list(
      alpha = alpha,
      beta = beta,
      sigma = sigma),
    generated = list(
      N = N, K = K,
      x = x, y = y
    )
  )
}

3 / 13

Task 1/2

Build the generator and run simulations. Once again, 10 simulations should serve us well.

generator_regression <- SBC_generator_function(
               single_sim_regression, N = 100, K = 2)
datasets_regression <- generate_datasets(
               generator_regression, 10)

4 / 13

Suspect 1

Already saved in regression_1.stan if you ran the setup script, otherwise store it in this file yourself :-)

data {
  int<lower=0> N;   // number of data items
  int<lower=0> K;   // number of predictors
  matrix[N, K] x;   // predictor matrix
  vector[N] y;      // outcome vector
}
parameters {
  real alpha;           // intercept
  vector[K] beta;       // coefficients for predictors
  real<lower=0> sigma;  // error scale
}
model {
  vector[N] mu = rep_vector(alpha, N);
  for(i in 1:K) {
    for(j in 1:N) {
      mu[j] += beta[i] * x[j, i];
    }
  }
  y ~ normal(mu, sigma);  // likelihood
  alpha ~ normal(0, 5);
  beta ~ normal(0, 1);
  sigma ~ normal(0, 2);
}

5 / 13

Suspect 2

Store in a file called regression_2.stan.

data {
  int<lower=0> N;   // number of data items
  int<lower=0> K;   // number of predictors
  matrix[N, K] x;   // predictor matrix
  vector[N] y;      // outcome vector
}
parameters {
  real alpha;           // intercept
  vector[K] beta;       // coefficients for predictors
  real<lower=0> sigma;  // error scale
}
model {
  vector[N] mu;
  for(i in 1:N) {
    mu[i] = alpha;
    for(j in 1:K) {
      mu[i] += beta[j] * x[j, j];
    }
  }
  y ~ normal(mu, sigma);  // likelihood
  alpha ~ normal(0, 5);
  beta ~ normal(0, 1);
  sigma ~ normal(0, 2);
}

6 / 13

Suspect 3

Store in a file called regression_3.stan.

data {
  int<lower=0> N;   // number of data items
  int<lower=0> K;   // number of predictors
  matrix[N, K] x;   // predictor matrix
  vector[N] y;      // outcome vector
}
parameters {
  real alpha;           // intercept
  vector[K] beta;       // coefficients for predictors
  real<lower=0> sigma;  // error scale
}
model {
  y ~ normal(transpose(beta) * transpose(x) + alpha, sigma);  // likelihood
  alpha ~ normal(0, 5);
  beta ~ normal(0, 1);
  sigma ~ normal(0, 2);
}

7 / 13

Question 1/2

Make a guess - which of the suspects you believe implement linear regression correctly? (could be all three) Which of the suspects will not give correct results? (could be all three).

Don't spend too much time investigating, if you don't see it - SBC is going to help us!

8 / 13

Task 2/2 - Run SBC

Backend with cmdstanr

  model_regression_1 <- cmdstan_model("regression_1.stan")
  backend_regression_1 <- SBC_backend_cmdstan_sample(
      model_regression_1, 
      iter_warmup = 400, iter_sampling = 500)

Backend with rstan

  model_regression_1 <- stan_model("regression_1.stan")
  backend_regression_1 <- SBC_backend_rstan_sample(
      model_regression_1, 
       iter = 900, warmup = 400)

SBC + plots

results_regression_1 <- 
  compute_SBC(datasets_regression, backend_regression_1)
plot_rank_hist(results_regression_1)
plot_ecdf(results_regression_1)
plot_ecdf_diff(results_regression_1)

(Repeat for all 3 suspects)

9 / 13

Question 2/2

Which of the suspects were caught misleading us using SBC? Which variables are affected the most and in which direction?

10 / 13

Bonus Task 1

Figure out what went wrong in the problematic model(s). Can you fix the problem(s)?

There is a hint on the next slide.

11 / 13

Bonus Task 1 - Hint

Indexing and for loops are a frequent source of typos. Some letters just look quite similar.

12 / 13

Bonus Task 2

Can you explain why the problematic model(s) produced the SBC plots they did?

Bonus Task 3

Look at results_regression_X$stats for the problematic models. Would it be possible to use z_score to diagnose the problem faster?

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Excercise 3

Martin Modrák

2021/08/24 (updated: 2022-01-26)

A detective story

Generator

Task 1/2

Suspect 1

Suspect 2

Suspect 3

Question 1/2

Task 2/2 - Run SBC

Question 2/2

Bonus Task 1

Bonus Task 1 - Hint

Bonus Task 2

Bonus Task 3

A detective story

Help