---
title: "Crash Course On Bayesian Regression Modelling"
output: 
  bookdown::html_document2:
    base_format: rmarkdown::html_vignette
    fig_width: 5
    fig_height: 3
    toc: true
link-citations: yes
pkgdown:
  as_is: true
vignette: >
  %\VignetteIndexEntry{Crash Course On Bayesian Regression Modelling}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
bibliography: references.bib
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
#| message = FALSE
## Start by loading the required packages
library(bayesplot)
library(broom)
library(crashbayes)
library(cowplot)
library(distributional)
library(dplyr)
library(forcats)
library(ggplot2)
library(ggtext)
library(kableExtra)
library(rstanarm)
library(tibble)
library(tidyr)
library(tidybayes)

## Enable parallelisation in order to leverage multiple cores
options(mc.cores = parallel::detectCores())
```

```{r echo = FALSE}
## Set a few ggplot style parameters for figures
theme_set(
  theme_tidybayes() +
    theme(
      panel.border = element_rect(color = 'grey85', fill = NA, linetype = 1, linewidth = 1)
    )
)

# Set the RNG so we all get the same pseudo-random values
set.seed(20230305)
```

# Introduction

## Requirements

In this tutorial I will assume that you have at least:

1. a working knowledge of the R essentials

    I will be using the `tidyverse`, in particular `ggplot2`, because I find them expressive and easy to understand. But they are
    not essential to follow the training correcly.

2. basic training in statistical regression modelling, specifically (generalized) linear models

    We will also consider hierarchical (aka _mixed effcts_, or _multilevel_) 
    models, (generalized) additive models, ordinal models and others.
    But all those are extensions of the former for specific situations and
    you will surely be able to at least get the gist of it. You can sort out
    the details later.
    

In particular, I don't expect you to have any prior knowledge about Bayesian
statistics. Perhaps you heard about it, and you have some motivation to learn
more. 


You may have the preconception that Bayesian inference is __harder__ that
classical frequentist inference. There are those weird things called _priors_ that
you need to specify beforehand and who knows what they do, there are Monte Carlo
Markov Chain (MCMC) simulations that take forever and you need to diagnose and
validate __in addition__ to the regular diagnosis and validation of the model
itself (!). And then, you do not even get any miserable _p_-value that help you
classify effects properly.


## Objectives

All what follows intends to address the above concerns and demonstrate that:

1. Bayesian inference is __not necessarily difficult__ to implement. Nowadays there
are awesome R-packages that make the most common regression analyses very straightforward. As easy and fast as any classical frequentist regression.

2. Yes, _priors_ can be tricky. But in most cases there are suitable sensible 
defaults. It is still important that you understand their role, because they
can prove __extremely useful__ and give you a super-power.

3. Okay, _MCMC_ can also get tricky. But again, in most common cases everything works well and smooth. If it does not, there is usually a problem with the model
or the data that would have caused troubles under a frequentist approach as well
(whether you notice it or not). 

4. Most times, _p_-values cause more harm than good. Their interpretation is
very tricky and people tend to draw all sorts of incorrect conclusions from them.
In contrast, __posterior distributions__ are way more informative, and tell you
exactly what you typically want to know. And yes, you can publish papers without _p_-values.


## Software

There are an overwhelming number of [R-packages for Bayesian inference](https://cran.r-project.org/web/views/Bayesian.html) on CRAN alone. 

I have based this training on [`rstanarm`](http://mc-stan.org/rstanarm/) [@goodrichRstanarmBayesianApplied2020] because I find its user interface particularly natural to transitioning from classic modelling R packages.

It uses [Stan](https://mc-stan.org/) in the background, which is one of the most advanced [probabilistic programming languages](https://en.wikipedia.org/wiki/Probabilistic_programming) available.

Moreover, it works very well in combination with [`bayesplot`](https://mc-stan.org/bayesplot/) [@gabryBayesplotPlottingBayesian2022], a package for graphical exploration of Bayesian models.


# Hello (Bayes) World!

Let's simulate some data from a Linear Model

\begin{align}
Y & \sim \mathcal{N}(\mu, \sigma) \\
\mu & = \beta_0 + \beta_x X
(\#eq:lm1)
\end{align}


```{r dat-simple-regression}
beta <- c(1, -1)    # "True" values of the regression coefficients
dat <- tibble(
  x = runif(50, -1, 1),  # 50 observed values of covariate X, within (-1, 1)
  y = beta[1] +  beta[2] * x + rnorm(50)  # observations of the response
)
```

```{r dat-plot, fig.cap = cap, echo = FALSE}
cap <- paste(
  "Simulated data with the true underlying relationship in orange."
)
dat |> 
  ggplot() +
  aes(x, y) +
  geom_point() +
  geom_abline(intercept = beta[1], slope = beta[2], colour = "darkorange")
```


## Linear model

Fit the Linear Model \@ref(eq:lm1) to the simulated data `dat` with:

```{r fm-hw}
#| results = 'hide'
fm_hw <- rstanarm::stan_glm(
  y ~ x, 
  data = dat
)
```

```{r fm-hw-stan-lm}
#| eval = FALSE,
#| include = FALSE

## Equivalently, I could have use rstanarm's analogous to lm()
## but it forces me to explicitly declare a prior (even to request a default)
## which I will introduce later.
fm_hw_stan_lm <- rstanarm::stan_lm(
  y ~ x, 
  data = dat,
  prior = NULL  # Use default priors
)
```


and obtain numerical and graphical summaries of the results using the generic
R functions `summary()` and `plot()`:

```{r fm-summary}
summary(fm_hw)
```


```{r fm-plot}
# Equivalent to bayesplot::mcmc_intervals(fm_hw)
# See ?plot.stanreg for plotting options
plot(fm_hw)
```


__Congratulations!!!__ you have just fitted a Bayesian linear model!!


<!-- TODO: Some config for parallelisation ? like `options(mc.cores)`? -->


## Inspecting results

Let's fit the same model using a frequentist methodology. The code is
essentially the same as before, just replacing the call to
`rstanarm::stan_glm()` by `stats::lm()`[^1].


[^1]: The syntax `pkg::fun()` in R can be used to specify explicitly the package
the function belongs to. The package part can be omitted, as I have done in the
call to `lm()`.


```{r fm-hw-lm}
fm_hw_lm <- lm(
  y ~ x, 
  data = dat
)

summary(fm_hw_lm)
```


```{r fm-hw-estimates}
#| echo: FALSE
fm_hw_estimates <- 
  ## Bayesian estimates
  mcmc_intervals_data(fm_hw, point_est = "mean", prob_outer = .95) |> 
  select(parameter, point_est = m, ll, hh) |> 
  add_column(method = "Bayes") |> 
  
  ## Frequentist estimates
  bind_rows(
    broom::tidy(fm_hw_lm, conf.int = TRUE) |> 
      select(
        parameter = term, point_est = estimate, ll = conf.low, hh = conf.high
      ) |> 
      ## I need to use a custom function to compute the CI for sigma, which
      ## is technically not a model parameter in the frequentist world and it
      ## does not get any inferential results by default.
      bind_rows(
        sigma_ci(fm_hw_lm, alpha = 0.05)
      ) |> 
      add_column(method = "Freq")
  ) |> 
  mutate(parameter = forcats::fct_inorder(parameter))
```

```{r plot-estimates-hw}
#| warning = FALSE,
#| fig.cap = cap,
#| echo    = FALSE
cap <- paste(
  "<span style='color:darkgrey'>**Frequentist**</span> and",
  "<span style='color:darkorange'>**Bayesian**</span>",
  "point and interval estimates for the *Hello World* linear model."
)

ggplot(fm_hw_estimates) +
  aes(point_est, parameter, colour = method) +
  geom_pointrange(
    aes(xmin = ll, xmax = hh),
    position = position_dodge(width = 0.3),
    show.legend = FALSE
  ) +
  labs(
    x = "Estimate",
    y = "Parameter"
  ) +
  scale_colour_manual(
    values = c("darkorange", "darkgrey")
  )
```


The point and interval estimates of the model parameters ($\beta_0$, $\beta_x$ and
$\sigma$) are compared in the Figure \@ref(fig:plot-estimates-hw) [^2].

[^2]: The code for producing the figure involves using [`bayesplot::mcmc_intervals_data()`](https://mc-stan.org/bayesplot/reference/MCMC-intervals.html) and [`broom::tidy.lm()`](https://broom.tidymodels.org/reference/tidy.lm.html) for extracting results from the Bayesian and frequentist model fits and gathering everything together to perform the comparison. It is hidden because it is not the main focus of the training, but you can always [check the code](https://umr-astre.pages.mia.inra.fr/training/crashbayes/articles/crashbayes.R).


The numerical estimates from both approaches are almost identical in this
example. However, this will not always be the case.
Besides, their interpretation is totally different, as we will see later.

To begin with, they are called differently:

```{r nomenclature}
#| echo: FALSE
tribble(
  ~Estimate, ~Frequentist, ~Bayesian,
  "Point",   "Maximum-Likelihood",  "Posterior mean (or median, or...)",
  "Interval", "Confidence Interval",     "Credible Interval"
) |> 
  kbl(
    caption = paste(
      "Nomenclature of point and interval estimates in the frequentist",
      "and the Bayesian frameworks."
    )
  ) |> 
  kable_styling(
    full_width = FALSE
  )
```

## In summary

- We fitted a Bayesian linear model using the package `rstanarm` to some simple
simulated data

- It took __nearly the same effort__ as the frequentist equivalent and gave __nearly the same numerical results__

- Yet, it seems that these results have __different meaning__ and there is some __prior__ thing
for which we used a default specification.

- Pending questions: why bother if numerical results are similar; what is the role of the prior; how is the interpretation of the results different.

We will tackle the first two questions in the next section. But here is a 
teaser: numerical results are sometimes __very different__ from frequentist outcomes, and this has everything to do with the __priors__.


# Priors: contextualising data

Suppose that you are the head of a biological laboratory recently installed in a
small village. While Covid's positivity rate is currently very low (< 5%) overall in 
the country, out of your first 3 tests, __2 were positive__ for SARS-CoV-2.

```{r pcr-dat}
pcr_dat <- data.frame(y = c(1, 1, 0))
```


```{r plot-pcr-dat}
#| fig.cap = "2 out of 3 patients in the sample tested positive.",
#| fig.width = 4,
#| fig.height = 2,
#| echo = FALSE
pcr_dat |> 
  bind_cols(
    expand.grid(
      r = seq_len(3),
      c = seq_len(1)
    )
  ) |> 
  ggplot(aes(r, c)) +
  aes(col = factor(y)) +
  geom_point(size = 15, show.legend = FALSE) +
  scale_color_manual(values = c("grey", "darkorange")) +
  coord_fixed(clip = "off") +
  lims(x = c(0.5, 3.5), y = c(0.5, 1.5)) +
  theme_void()
```

Your assistant computes frequentist estimates for a proportion in order to estimate
the positivity rate in the village and to _test_ whether the observed data is consistent
with the hypothesis of a positivity rate of 5%:

```{r prob-test-pcr}
prop.test(x = sum(pcr_dat$y), n = nrow(pcr_dat), p = 0.05)
```

Horrified, he claims to have found that Covid's positivity rate in the village is
very likely above 10% (95% CI: [12-98]) and certainly (p < 0.001) above the global positivity rate of 5% and urges you to fire the epidemiological alarm.

__What would you do?__

More experienced, you calm him down and explain that the situation certainly
deserves attention, but you need to collect more confirmatory evidence before
panicking. Indeed,

- In a context of __low epidemic transmission__, a true high positivity rate in a small village would be very unlikely.

- Instead, other factors appear as __more plausible explanations__ for the observations. Even if they are relatively unlikely themselves. E.g.,
    - Malfunctioning equipment
    - Human mistakes in the manipulation of the samples or the results
    - Patients might be relatives or have recently gathered (i.e. not a _random_,  sample of independent test-users)

In a way, you are __weighting__ the evidence in the observed data with your 
__prior knowledge__ about the epidemiological situation.
In the current context, a positivity rate of 66 % is very unlikely (i.e., incompatible with to your
prior expectations), and __extraordinary claims require extraordinary evidence__. 

Bayesian statistics allow to embed these considerations into the model, and to 
produce estimates __given the available data and your prior expectations__.

[[Diagnostic tests example](https://umr-astre.pages.mia.inra.fr/training/notions_stats/slides/S8.1_ICs-NHST.html#29)]

## Logistic model

The model used in this case is a simple extension of the linear model \@ref(eq:lm1) (i.e. a _Generalized_ Linear Model, or GLM) where we replaced the Gaussian likelihood by a Binomial and there
are no regression variables in the linear predictor.

Let $Y$ be the number of positive tests in the sample of size $n$,

\begin{align}
Y & \sim \text{Bi}(n, \pi) \\
\eta & = \text{logit}(\pi) = \beta_0
(\#eq:glm1)
\end{align}
where $\text{logit}(\pi) = \log(\pi / (1 - \pi))$.

The quantity of interest is the _population_'s positivity rate $\pi \in (0, 1)$.
However, the corresponding model parameter is $\beta_0 \in \mathbb{R}$ (the intercept of the linear predictor) who varies in the _logit_ scale.

```{r plot-logit}
#| fig.cap = "The logit transformation stretches the interval (0, 1) to the real line.",
#| fig.width = 3,
#| echo = FALSE
tibble(pi = seq(0.01, .99, by = 0.01)) |> 
  mutate(b0 = qlogis(pi)) |> 
  ggplot() +
  aes(pi, b0) +
  geom_path() +
  labs(x = expression(pi), y = expression(beta[0])) 
```


Let's first verify that this model actually yields analogous estimates for the
population positivity rate, using a frequentist procedure.

```{r fm-pcr-glm}
(
  fm_pcr_glm <- 
    glm(
      y ~ 1,
      family = binomial, 
      data = pcr_dat
    )
)
```

Only, the estimate is in _logit_ scale and we have to transform the results,
including the confidence interval, back to the original scale. How exactly to do
that, is not important now (you can check the [source
code](https://umr-astre.pages.mia.inra.fr/training/crashbayes/articles/crashbayes.R)).
Actually, there are several alternative methods leading to different results.

```{r fm-pcr-tidy}
#| echo = FALSE
fm_pcr_tidy <- 
  broom::tidy(fm_pcr_glm, conf.int = TRUE) |> 
  select(estimate, starts_with('conf')) |> 
  mutate(across(.fns = plogis)) |> 
  add_column(term = "$\\pi$", .before = 1)
```


```{r tab-fm-pcr-tidy}
#| echo = FALSE
kbl(
  fm_pcr_tidy,
  caption = paste(
    "Frequentist point and interval estimates of the population positivity rate",
    "in the PCR example."
  ),
  digits = 2
) |> 
  kable_styling(
    full_width = FALSE
  )

```

While we obtained the same point estimate as with `prop.test()` (which is simply
the sample proportion of positive tests), the lower end of the 95% Confidence
Interval is somewhat higher. This is simply a consequence of the different
procedures used for the computation of CIs in each case. Again, this is not very
important at the moment.

However, what I want to establish is that this GLM essentially reproduces our
initial estimates for the population positivity rate.


## Prior elicitation

In order to fit the same model using a Bayesian approach, we want to __encode our prior knowledge into a probability density__ for the model parameter $\beta_0$.

This is not to mean that $\beta_0$ is _random_ in the sense that its value varies. However, we __don't know__ its actual value, and we are using a probability distribution to __represent the available information__ about its value.

We __know__ that the global population positivity rate is below 5%,
and we would be very surprised that the true positivity rate in the village was
larger than 50%.
We can translate that knowledge into the following probability statements:

\begin{align}
.50 & = P(\pi \leq .05) = P(\beta_0 \leq \text{logit}(.05)) \\
.99 & = P(\pi \leq .50) = P(\beta_0 \leq \text{logit}(.50)),
\end{align}
since $\beta_0 = \text{logit}(\pi)$.

In other words, $q_{.50} = \text{logit}(.05)$ and $q_{.99} = \text{logit}(.50)$ are the 
$.50$ and $.99$ quantiles of $\beta_0$'s distribution.
Note that quantiles are preserved under non-linear transformations.

If we consider a prior Gaussian distribution on $\beta_0$ as
\[\beta_0 \sim \mathcal{N}(\mu_0,\, \sigma_0), \]
we just have to choose the values of $\mu_0$ and $\sigma_0$ that match the probability statements above. 


### Mathematical approach

Writing $\beta_0 = \mu_0 + \sigma_0 \phi$, with $\phi \sim \mathcal{N}(0, 1)$, we have $q_\kappa = \mu_0 + \sigma_0 \Phi^{-1}(\kappa)$, where $\Phi^{-1}$ is the quantile function of the standard normal distribution (a.k.a. the _probit_ function).

<!-- So we have : -->

<!-- \begin{align} -->
<!-- q_{.50} = \mu_0 + \sigma_0 \Phi^{-1}(.5) = \mu_0 & = \text{logit}(.01) \\ -->
<!-- q_{.99} = \mu_0 + \sigma_0 \Phi^{-1}(.99) & = \text{logit}(.50) -->
<!-- \end{align} -->

Solving for $\mu_0$ and $\sigma_0$,

```{r prior-parameters}
mu0 <- qlogis(.05)  # In R, qlogis() gives the logit function. See help.
s0 <- (qlogis(.50) - qlogis(.05)) / qnorm(.99)
```


```{r priors-dat}
#| echo: FALSE
priors_dat <- data.frame(
  term = c("beta", "pi"),
  dist = c(
    dist_normal(mu0, s0), 
    dist_transformed(dist_normal(mu0, s0), plogis, qlogis)
  )
)

```

\begin{align}
\mu_0 & = q_{.50} = \text{logit}(.05) = `r round(mu0, 2)` \\
\sigma_0 & = \Big(\text{logit}(.50) - \text{logit}(.05) \Big) / \Phi^{-1}(.99)  = `r round(s0, 2)`.
\end{align}
<!-- where $\Phi^{-1}$ is the quantile function of the standard normal distribution (a.k.a. the _probit_ function). -->

```{r plot-priors}
#| echo = FALSE,
#| warning = FALSE,
#| fig.width = 6,
#| fig.cap = "Elicited prior distribution for the model parameter and the implied prior on the positivity rate."
plot_grid(
  ggplot(priors_dat) +
    stat_halfeye(
      data = ~ filter(., term == "beta"),
      aes(
        xdist = dist,
        fill = after_stat(cut(x, c(-Inf, mu0, qnorm(.99, mu0, s0), Inf)))
      ),
      show.legend = FALSE
    ) +
    scale_fill_manual(values = c('gray85', 'gray60', 'lightblue')) +
    labs(x = expression(beta), y = NULL) +
    theme(
      axis.text.y = element_blank(),
      axis.line.y = element_blank(),
      axis.ticks.y = element_blank(),
      panel.border = element_blank()
    )
  ,
  ggplot(priors_dat) +
    stat_halfeye(
      data = ~ filter(., term == "pi"),
      aes(
        xdist = dist,
        fill = after_stat(cut(x, plogis(c(-Inf, mu0, qnorm(.99, mu0, s0), Inf))))
      ),
      show.legend = FALSE
    ) +
    scale_fill_manual(values = c('gray85', 'gray60', 'lightblue')) +
    labs(x = expression(pi), y = NULL) +
    lims(x = c(0, 0.6)) +
    coord_cartesian(ylim = c(0, .3)) +
    theme(
      axis.text.y = element_blank(),
      axis.line.y = element_blank(),
      axis.ticks.y = element_blank(),
      panel.border = element_blank(),
      plot.margin = margin(0, 0, 10, 0)
    )
)

```

It is a good idea to visualize the implied prior in terms of the _natural_ scale, for confirmation.
Once we are satisfied with our probabilistic _representation_ of our previous knowledge, we can proceed with the estimation.

This method of prior elicitation requires some mathematical skills. Furthermore, it exploits some specific properties of the Gaussian distribution.
In more general and complex scenarios this approach might become unfeasible or impractical.


### Computational approach

An alternative method is to take advantage of the computational capacity at our disposal and let the computer find the values for us.
This is based on numerical optimisation and is completely general.
Instead, it requires some programming skills in order to define custom functions.

```{r prior-elicitation-optimisation}
## Quantile function for pi, given the prior parameters
qpi <- function(mu0, s0, p0) {
  setNames(
    plogis(qnorm(p = p0, mean = mu0, sd = s0)),
    paste0("q_", p0)
  )
}
# Verify that the prior paramters computed before yield the desired quantiles
# qpi(-2.94, 1.27, p0 = c(.5, .99))

## Target function to be optimised with respect to parameters
## Simply the sum of squared difference of the quantile function
## with respect to the target quantiles
target_pi <- function(par, p0, q0) {
  sum( (qpi(par[1], par[2], p0) - q0) ** 2 )
}
# target_pi(par = c(-4.6, 1.98), p0 = c(.5, .99), q0 = c(.01, .5))

## Call to R's general optimiser, to find the parameters that minimise
## the value of the target function.
(par0 <- optim(
  par = c(0, 1),   # initial values of parameters
  fn = target_pi,  # function to minimize with respect to the first argument
  p0 = c(.5, .99), # target cumulative probabilities
  q0 = c(.05, .5)  # target corresponding quantiles
))

## Check
## Verify that the parameters found yield the required quantiles for pi
qpi(par0$par[1], par0$par[2], p0 = c(.5, .99))

```


## Results

We can now proceed to fit the model, including the elicited prior, to the data (Fig. \@ref(fig:plot-pcr-dat)).

```{r fm-pcr}
#| results = 'hide',
pcr_prior_lowp <- normal(-2.94, 1.26)

fm1_pcr <- rstanarm::stan_glm(
  y ~ 1,
  family = binomial(),
  data = pcr_dat,
  prior_intercept = pcr_prior_lowp
)
```


```{r fm-pcr-estimates}
#| echo: FALSE

fm_pcr_estimates <- 
  ## Bayesian estimates (posterior mean and 95% quantile interval)
  mcmc_intervals_data(
    fm1_pcr,
    transformations = plogis,
    point_est = "median",
    prob_outer = .95
  ) |> 
  select(parameter, point_est = m, ll, hh) |> 
  add_column(method = "Bayes") |>
  ## Frequentist estimates
  bind_rows(
    fm_pcr_tidy |> 
    # broom::tidy(fm_pcr_glm, conf.int = TRUE) |> 
      select(
        parameter = term, point_est = estimate, ll = conf.low, hh = conf.high
      ) |>
      add_column(method = "Freq")
  ) |> 
  mutate(parameter = "pi")
```

The point and interval estimates of the positivity rate $\pi$ from frequentist and Bayesian perspectives are displayed in Figure \@ref(fig:plot-estimates-pcr)

```{r plot-estimates-pcr}
#| warning = FALSE,
#| fig.cap = cap,
#| echo    = FALSE
cap <- paste(
  "<span style='color:darkgrey'>**Frequentist**</span> and",
  "<span style='color:darkorange'>**Bayesian**</span>",
  "point and interval estimates for the PCR example.",
  "The global positivity rate of 5% was added as a vertical line for reference."
)

ggplot(fm_pcr_estimates) +
  aes(point_est, parameter, colour = method) +
  geom_vline(xintercept = 0.05, colour = "gray90") +
  geom_pointrange(
    aes(xmin = ll, xmax = hh),
    position = position_dodge(width = 0.3),
    show.legend = FALSE
  ) +
  labs(
    x = "positivity rate",
    y = NULL
  ) +
  scale_x_continuous(breaks = seq(0, 1, by = 0.2)) +
  scale_colour_manual(
    values = c("darkorange", "darkgrey")
  ) +
  theme(
    axis.text.y = element_blank()
  )
```

## Exercises

1. Match the priors

    Suppose that the government stopped monitoring SARS-CoV-2 (hypothetically) so you don't have a reliable picture of the global situation. You consider 3 potential prior distributions for the __prevalence__:

    1. __Optimistic__: you are confident that the prevalence must be under control
    
    2. __Clueless__: you are unsure, anything is possible
    
    3. __Conspiracist__: despite the seeming normality, you beleive that the situation is dire
    
    Can you match the following plots with the priors?
    
    ```{r match-priors}
    #| echo: FALSE
    tibble(
      candidate = LETTERS[1:3],
      dist = dist_beta(c(1, 6, 12), c(1, 9, 1))
    ) |> 
      add_column(
        x = list(seq(0.01, .99, by = .01))
      ) |> 
      rowwise() |> 
      mutate(
        y = list(density(dist, at = x))
      ) |> 
      unnest(cols = c('x', 'y')) |> 
      ggplot() +
      aes(x, y) +
      geom_area(
        alpha = 0.3,
        fill = 'steelblue'
      ) +
      geom_line(colour = 'steelblue') +
      facet_wrap(~candidate) +
      labs(
        x = "Prevalence",
        y = NULL
      ) +
      theme(
        axis.text.y = element_blank(),
        axis.line.y = element_blank(),
        axis.ticks.y = element_blank(),
        panel.border = element_blank()
      )
    
    ```
    

2. High-prevalence context

    Suppose that you observed the __same data__ but in a context where the global
positivity rate is of about 50%, with a lot of variation across the territory (i.e. cases are concentrated on some specific regions).

    Change your prior accordingly and refit the model. How did conclusions change?


```{r solution-exercise-2}
#| echo: FALSE
#| include: FALSE

## Prior elicitation
## Let's set the "centre" (q50) of the prior at 0.5 and 70% of cumulated
## probability by 0.9.
(par0 <- optim(
  par = c(0, 1),   # initial values of parameters
  fn = target_pi,  # function to minimize with respect to the first argument
  p0 = c(.5, .7), # target cumulative probabilities
  q0 = c(.5, .9)  # target corresponding quantiles
))

## Check
## Verify that the parameters found gives the required quantiles for pi
qpi(par0$par[1], par0$par[2], p0 = c(.5, .7))


## Define prior for future reference
pcr_prior_highp <- normal(0, 4)

## Visualise the implied prior on pi
## highlight the cumulative 0.7 probability up to pi = 0.9
ggplot() +
  stat_halfeye(
    aes(
      xdist = dist_transformed(dist_normal(0, 4), plogis, qlogis),
      fill = stat(x < .9)
    ),
    show.legend = FALSE
  ) +
  scale_fill_manual(values = c("gray85", "skyblue"))

## Fit model
fm2_pcr <- rstanarm::stan_glm(
  y ~ 1,
  family = binomial(),
  data = pcr_dat,
  prior_intercept = pcr_prior_highp
)

## Proportion of values below 0.5
round(mean(plogis(as.matrix(fm2_pcr)) < .5), 2)

## Extract results from the same model under both priors (scenarios)
ex1_dat <- bind_rows(
  mcmc_intervals_data(fm1_pcr, transformations = 'plogis') |> 
  add_column(scenario = "low\nprevalence"),
  mcmc_intervals_data(fm2_pcr, transformations = 'plogis') |> 
  add_column(scenario = "high\nprevalence")
)
```


```{r plot-exercise-2, fig.cap = cap}
#| echo: FALSE
cap <- "Estimated positivity rates depend on the context."

ggplot(ex1_dat) +
  aes(m, scenario) +
  geom_pointinterval(
    aes(xmin = ll, xmax = hh)
  ) +
  geom_pointinterval(
    aes(xmin = l, xmax = h),
    size = 5
  ) +
  scale_x_continuous(limits = c(0, 1)) +
  labs(
    x = "Positivity rate",
    y = "General scenario"
  )

```


3. Larger sample size

    Suppose that the next day you tested 6 additional people, with the same
    proportion of positives (i.e. 4/6 positives). Aggregating the data from both
    days, obtain and compare the corresponding estimates for the positivity rate
    in the village under both scenarios.

```{r solution-exercise-3}
#| echo: FALSE
#| include: FALSE

pcr_dat2 <- rbind(pcr_dat, pcr_dat, pcr_dat)


fm3_pcr <- rstanarm::stan_glm(
  y ~ 1,
  family = binomial(),
  data = pcr_dat2,
  prior_intercept = pcr_prior_lowp
)

fm4_pcr <- rstanarm::stan_glm(
  y ~ 1,
  family = binomial(),
  data = pcr_dat2,
  prior_intercept = pcr_prior_highp
)


## Extract results from the same model under both priors (scenarios) with
## the additional data, and append to the previous results.
ex2_dat <- ex1_dat |> 
  add_column(n_obs = 3) |> 
  bind_rows(
    mcmc_intervals_data(fm3_pcr, transformations = 'plogis') |> 
      add_column(scenario = "low\nprevalence", n_obs = 9),
    mcmc_intervals_data(fm4_pcr, transformations = 'plogis') |> 
      add_column(scenario = "high\nprevalence", n_obs = 9)  
  )
```


```{r plot-exercise-3, fig.cap = cap}
#| echo: FALSE
cap <- paste(
  "As more data is observed, the impact of the prior progressively",
  "fades away and estimates become more precise.",
  "Results from the previous 'sparse'-data scenario are shown in light-grey for reference."
)

ggplot(ex2_dat) +
  aes(m, scenario, colour = factor(n_obs)) +
  geom_pointinterval(
    aes(xmin = ll, xmax = hh),
    position = position_dodge(width = 0.5),
    show.legend = FALSE
  ) +
  geom_pointinterval(
    aes(xmin = l, xmax = h),
    size = 5,
    position = position_dodge(width = 0.5),
    show.legend = FALSE
  ) +
  labs(
    x = "positivity rate",
    y = "General scenario"
  ) +
  scale_x_continuous(limits = c(0, 1)) +
  scale_colour_manual(
    values = c('grey80', 'grey30')
  )
```


## In summary

- The same data is (and should be) interpreted differently depending on the context [^3].

[^3]: This is part of the controversy between the frequentist and Bayesian schools (science was all about _objectivity_ right?), and is partly rooted on epistemological stances. I'm not getting into this garden today, but let me point you to [@siegfriedOddsAreIt2010] for an overview of the topic.


- This can be formalised by using prior distributions over the model parameters in a Bayesian framework.

- Bayesian statistics uses __probability__ to represent __information__ about things (e.g. model parameters).

- Priors are particularly important when observations are scarce, extreme, unexpected or uninformative. They work as a [safety device](http://elevanth.org/blog/2017/08/22/there-is-always-prior-information/).

- Conversely, the impact of the prior vanishes as the evidence accumulates. Regardless of prior beliefs, everyone reach to the same conclusions after observing sufficient data.


# Posteriors: much more than point-and-interval estimation


In the same way as the prior distribution represents the information about a parameter __before__ observing the data, the Bayesian model fit yields full __posterior__ probability distributions for every parameter, representing the
knowledge about the parameters __after__ observing the data.

Rather than a mathematical expression for the posterior distribution, `rstanarm` yields a __random sample__ from it, which can be summarized in different ways: posterior mean, median, mode, quantiles, credible intervals... or a summary plot of the posterior densities, as shown in Figure \@ref(fig:plot-posteriors).

The function
[`bayesplot::mcmc_areas()`](https://mc-stan.org/bayesplot/reference/MCMC-intervals.html)
plots the posterior distributions of the parameters or any transformation of them.

```{r plot-posteriors}
#| fig.width = 6,
#| fig.cap = paste(
#|    "Posterior densities of the positivity rate in the logistic and natural scales.",
#|    "The vertical lines and bands represent the medians and the 50% central intervals respectively."
#| )
plot_grid(
  mcmc_areas(fm1_pcr),  #  posterior for the (only) model parameter beta_0
  mcmc_areas(fm1_pcr, transformations = 'plogis')  # posterior for pi
)
```


This is __much more informative__ than simple point and interval estimates.
Unlike frequentist Confidence Intervals, you can formulate all kind of _probabilistic statements_ using
the posterior distributions of the model parameters or their transformations.

For instance, what is the __probability__ that the positivity rate $\pi$ (a transformation of the model parameter $\beta_0$) exceeds 0.5, or is below 0.1...?

```{r post-pi-below-50}
post_samples <- as.matrix(fm1_pcr)  # 4000 x 1 matrix with values from
                                    # the posterior distribution of beta_0
post_pi <- plogis(post_samples)     # posterior sample for pi

post_pi_below_50 <- mean(post_pi < .5)  # Proportion of values below 0.5
```

Despite your strong prior beliefs ($P(\pi < 0.50) = 0.99$), after looking at 3 observations, the __posterior probability__  that the positivity rate in the village is below 0.5 dropped to $P(\pi < 0.50 \,|\, \text{Data}) = `r round(post_pi_below_50, 2)`$.

```{r plot-prior-vs-posterior}
#| echo = FALSE,
#| warning = FALSE,
#| fig.width = 6,
#| fig.cap = cap

cap <- paste(
  "The <span style='color:darkgreen'>__posterior__</span> distribution",
  "is making a compromise between your",
  "<span style='color:steelblue'>__prior__</span>",
  "knowledge and the evidence from the",
  "<span style='color:darkorange'>__observed data__</span>",
  "(scaled likelihood)."
)
lik_pcr <- function(x, dat) {
  dbinom(sum(dat$y), size = nrow(dat), prob = x)
}
lik_sum <- integrate(lik_pcr, lower = 0, upper = 1, dat = pcr_dat)

mcmc_areas_data(
  fm1_pcr,
  transformations = 'plogis'
) |> 
  filter(interval == "outer") |>
  select(
    x, y_post = density
  ) |> 
  mutate(
    y_prior = unlist(distributional:::density.distribution(
      priors_dat |> filter(term == 'pi') |> pull(dist),
      at = x
    )),
    y_like = lik_pcr(x, dat = pcr_dat),
    y_like = y_like / lik_sum$value
  ) |> 
  pivot_longer(
    cols = starts_with("y_"),
    names_prefix = "y_",
    names_to = "dist",
    values_to = "y"
  ) |> 
  ggplot() +
  aes(x, y) +
  geom_line(aes(colour = dist), show.legend = FALSE) +
  geom_area(
    data = ~ filter(., dist == 'prior') |> 
      mutate(y = replace(y, x > 0.5, NA)),
    alpha = 0.3,
    fill = 'steelblue'
  ) +
  geom_area(
    data = ~ filter(., dist == 'post') |> 
      mutate(y = replace(y, x > 0.5, NA)),
    alpha = 0.3,
    fill = 'darkgreen'
  ) +
  geom_point(
    data = data.frame(
      x = mean(pcr_dat$y),
      y = lik_pcr(mean(pcr_dat$y), dat = pcr_dat) / lik_sum$value
    ),
    colour = 'darkorange'
  ) +
  coord_cartesian(ylim = c(0, 3)) +
  scale_colour_manual(
    values = c(prior = "steelblue", post = "darkgreen", like = "darkorange")
  ) +
  labs(
    x = "positivity rate",
    y = NULL
  ) +
  theme(
    axis.text.y = element_blank(),
    axis.line.y = element_blank(),
    axis.ticks.y = element_blank(),
    panel.border = element_blank()
  )


```


Check out the excellent [interactive visualisation](https://rpsychologist.com/d3/bayes/) by Kristoffer Magnuson, to illustrate the interplay between these three components of Bayesian inference.


## Exercises

1. Using the previous [interactive visualisation
tool](https://rpsychologist.com/d3/bayes/), determine in which situations the
posterior distribution closely matches the likelihood, regardless of the effect
size.


2. Compute the posterior probabilities that the positivity rate in the village is below 0.5 under the high-prevalence and the larger sample size scenarios from the previous section.
    
    
    ```{r solutions-exercise-4}
    #| echo: FALSE
    cap <- paste(
      "Posterior probability that the positivity rate is below 50%",
      "depending on the scenario (global context and sample size),",
      "for the same proportion (2/3) of observed positives."
    )
    
    ## Define a few helper functions that act on fitted models
    
    ## Posterior probability that the positivity rate is under 50%
    post_prev_under50 <- function(x) {
      mean(plogis(as.matrix(x)) < .5)
    }
    
    ## Prior used: low- or high-prevalence context
    ## You can examine the priors used in a model fit with the `prior_summary()`
    ## function. 
    prior_context <- function(x) {
      ifelse(
        prior_summary(x)$prior_intercept$location < 0,
        "Low-prevalence",
        "High-prevalence"
      )
    }
    
    ## Arrange results in a summary table
    ex4_dat <- tibble(
      model = list(fm1_pcr, fm2_pcr, fm3_pcr, fm4_pcr)
    ) |> 
      mutate(
        context = sapply(model, prior_context),
        sample_size = sapply(model, \(.) nrow(.$data)),
        prob = sapply(model, post_prev_under50)
      ) |>
      select(-model) |> 
      pivot_wider(
        names_from = 'context',
        values_from = 'prob'
      )
    ```
    
    Table \@ref(tab:table-solutions-exercise-4) displays the results I obtained. Discuss and explain the differences between posterior probabilities across scenarios.
    
    
    ```{r table-solutions-exercise-4}
    #| echo: FALSE
    ## Remember, prior to observing the data,
    ## Low-prevalence scenario:  P(\pi < .5) = .99
    ## High-prevalence scenario: P(\pi < .5) = .50
    kbl(
      ex4_dat,
      caption = cap,
      digits = 2,
      col.names = c("Sample size", "Low-prevalence", "High-prevalence")
    ) |> 
      add_header_above(c(" ", "Context" = 2))
    
    ```


## In summary


- The resulting inference about model parameters (or any transformation of them) is encoded in __posterior probability density functions__ (or, rather, a random sample from them) which can be summarized in different ways for communication purposes.

- Credible Intervals can be interpreted in __probabilistic__ terms.

- Posterior distributions are a trade-off between prior information and evidence from the observed data. Strong priors and sparse data result in little learning, whereas weaker priors have negligible impact in the presence of abundant data.


# Comparison of group means


Comparing average expected outcomes of different groups is a typical goal in experimental settings (e.g. control and treatment groups),
for which several frequentist approaches are commonly used (_t_-tests, ANOVA, ANCOVA...).
Essentially, all these [reduce to a linear model](https://lindeloev.github.io/tests-as-linear/) with one or more categorical factors.


```{r plot-best}
#| echo = FALSE,
#| fig.height = 4,
#| fig.cap= cap

cap <- paste(
  "Simulated scores from two randomized groups of people:",
  "'drug' ($N_1 = 47$) and 'placebo' ($N_2 = 42$) who take an IQ test.",
  "The sample averages are represented by an orange point at the bottom."
)
best_summary <- 
  best |>
  group_by(group) |>
  summarise(y = mean(y)) |> 
  mutate(
    lab = paste("Avg.:", round(y, 2))
  )

best |> 
  ggplot() +
  aes(y) +
  geom_dotplot(binwidth = 1) +
  geom_point(
    data = best_summary,
    y = 0,
    colour = 'darkorange',
    size = 4
  ) +
  geom_text(
    data = best_summary,
    aes(label = lab),
    x = 110, y = .2,
    hjust = 0,
    # vjust = 0,
    colour = 'grey50'
  ) +
  geom_curve(
    data = best_summary,
    x = 110, y = .2,
    aes(xend = y + .3, yend = 0 + .03),
    colour = 'grey72',
    arrow = arrow(length = unit(5, 'pt')),
    curvature = .2
  ) +
  facet_grid(group~.) +
  labs(
    x = "IQ",
    y = NULL
  ) +
  coord_cartesian(clip = 'off') +
  theme(
    axis.text.y = element_blank(),
    axis.line.y = element_blank(),
    axis.ticks.y = element_blank(),
    panel.border = element_blank()
  )

```

The data represented in Figure \@ref(fig:plot-best) is extracted from @kruschkeBayesianEstimationSupersedes2013 and is included in `crashbayes` as [`best`](https://umr-astre.pages.mia.inra.fr/training/crashbayes/reference/best.html).

The average IQ score in the _drug_ group is slightly larger than that of the _placebo_ group. But the difference is relatively small with respect to the individual variation within groups.

The question is classically framed in terms of whether the drug has any _significant_ effect [^4].
This is a bit confusing, because it does not tell much about what we typically want to know [@greenlandStatisticalTestsValues2016; @gelmanValuesStatisticalPractice2013a].

[^4]: This is, whether the observed difference is larger than what would be expected most times (e.g. 95%) in replications of the experiment __if the effect was null__.


Instead, we are going to assess the relative __credibility__ of the possible values of the effect with a posterior probability distribution.


Let's start by assuming a tentative model for the data. A linear model (which is the model behind the $t$-test) is a good starting point.

\begin{align}
Y_i & \sim \mathcal{N}(\mu_i, \sigma) \\
\mu_i & = \beta_0 + \beta_\text{d} \mathbb{1}_\text{d}[i]
(\#eq:lm2)
\end{align}
where $i$ is the group (drug, or placebo), and $\mathbb{1}_\text{d}[i]$ evaluates to 1 when the observation belongs to the _drug_ group and to 0 otherwise.

The effect of the drug is $\beta_\text{d}$. But the model also involves the parameter $\beta_0$ (the expected value of the placebo group) and $\sigma$ the individual standard-deviation of values within both groups.


## Using default priors

The authors of `rstanarm` [@goodrichRstanarmBayesianApplied2020] have put a lot of thought and effort on coming up with reasonable [default priors](https://mc-stan.org/rstanarm/articles/priors.html#default-weakly-informative-prior-distributions-1).

Let's start from there and adjust later.

```{r fm0-best}
#| results = 'hide',
fm0_best <- rstanarm::stan_glm(
  y ~ 1 + group,
  family = gaussian(),  # Default, but making it explicit.
  data = best
)
```

Function `summary()` can be used on a fitted model in order to obtain summary and diagnostic statistics.
We can directly obtain point and interval numerical estimates for the target effect `groupdrug`.

```{r fm0-summary}
summary(fm0_best)
```

We can specify optional arguments to the `plot()` function in order to focus on
the target parameter $\beta_\text{d}$ and to display the full posterior
probability density rather than the default point-interval summary in figure
\@ref(fig:fm0-best-plot).

```{r fm0-best-plot}
#| fig.cap = "Posterior density function for the target parameter."
plot(fm0_best, plotfun = 'areas', pars = 'groupdrug')
```

Finally, we can extract a sample from the posterior distribution and compute,
for instance, the probability that the effect is positive by computing the
corresponding proportion.

```{r fm0-best-prob}
(post_drug_gt0 <- mean(as.matrix(fm0_best)[, "groupdrug"] > 0))
```

Note that this effect would be considered _not significant_ under a classical approach:

```{r fm0-best-freq}
fm0_best_freq <- lm(y ~ group, data = best)
summary(fm0_best_freq)
confint(fm0_best_freq)
```


We can check which priors were used in a given model using the function [`prior_summary()`](https://mc-stan.org/rstanarm/reference/prior_summary.stanreg.html)

```{r fm0-best-prior-summary}
prior_summary(fm0_best)
```

Here we can see:

  - A different prior for each of the three parameters in the model
  
  - Each of these default priors has been _adjusted_ to match the scale of the observed data [^5]
  
  - The prior intercept is _centred_ at the average response in the dataset
  
  - The prior intercept is applied to a (internally) re-parametrised model, where all the predictors are centred.
  
    This is very useful for continuous covariates, because the intercept represents the expected outcome at the average value of $x$ rather than at $x = 0$. But less so in this case, where the expected outcome of the placebo group is more meaningful than the expected average outcome, which depends essentially on the number of observations that we happened to have on each group.
    
    As explained in `rstanarm`'s [documentation on prior distributions](https://mc-stan.org/rstanarm/articles/priors.html#default-weakly-informative-prior-distributions-1), we can disable this by manually creating a intercept variable or by using an option for sparse computation (see next). 
    

[^5]: This is technically not fully Bayesian, as the prior should not depend on the observed data. But it is a necessary compromise for providing sensible default priors. And the _offense_ is very minor, as only the total variability (SD) of the data is used.


## Specifying custom priors


A little [domain knowledge](https://en.wikipedia.org/wiki/IQ_classification)
tells us that IQ tests are designed to have a median of 100 and a standard
deviation of 15. This is very valuable information for specifying priors for
$\beta_0$ and $\sigma$. For instance, we could propose:

\[
\begin{aligned}
\beta_0 & \sim \mathcal{N}(100, 5) \\
\beta_\text{d} & \sim \mathcal{N}(0, s_0) \\
\sigma  & \sim \text{Exp}(1/15) 
\end{aligned}
\]

Regarding the target parameter $\beta_\text{d}$, we don't expect the drug to
turn a a _average_ IQ (90-109) into a _very superior_ one (130+). Thus, we can
assume that the effect of the drug is almost surely below 30.


```{r sigma-0}
## Find the Gaussian SD that yields a cumulative probability of
## .9986 at x = 45
cump <- function(x) pnorm(30, mean = 0, sd = x)
target <- function(x) (cump(x) - .999)**2
(s0 <- optimise(target, interval = c(0, 20))$minimum)
```


```{r fm1-best-priors}
best_prior_b0 <- normal(100, 5)
best_prior_sigma <- exponential(1/15)
best_prior_bdrug <- normal(0, s0)
```


```{r best-prior-comparison}
#| fig.cap = cap,
#| fig.width = 8,
#| echo    = FALSE
cap <- paste(
  "<span style='color:lightgrey'>**Default**</span> and",
  "<span style='color:steelblue'>**custom**</span>",
  "priors for the model paramters."
)

tibble(
  parameter = fct_inorder(c("Intercept", "Coefficient", "Sigma")),
  x = list(seq(50, 150, by = 1), seq(-60, 60, by = 1), seq(0.01, 45, by = 0.2))
) |> 
  expand_grid(
    prior = fct_inorder(c("Default", "Custom"))
  ) |> 
  add_column(
    dist = c(
      c(## Default              ## Custom
        dist_normal(101, 12),   dist_normal(100, 5),   ## Intercept
        dist_normal(0, 24),     dist_normal(0, s0),    ## Coefficient
        dist_exponential(0.21), dist_exponential(1/15) ## Sigma
      )
    )
  ) |> 
  rowwise() |> 
  mutate(
    y = list(density(dist, at = x))
  ) |> 
  unnest(cols = c('x', 'y')) |> 
  ggplot() +
  aes(x, y) +
  geom_area(
    alpha = 0.3,
    aes(fill = prior),
    position = 'identity',
    show.legend = FALSE
  ) +
  geom_line(aes(colour = prior), show.legend = FALSE) +
  facet_wrap( ~ parameter, scales = 'free') +
  labs(
    x = NULL,
    y = NULL
  ) +
  scale_colour_manual(values = c(Default = "lightgrey", Custom = "steelblue")) +
  scale_fill_manual(values = c(Default = "lightgrey", Custom = "steelblue")) +
  theme(
    axis.text.y = element_blank(),
    axis.line.y = element_blank(),
    axis.ticks.y = element_blank(),
    panel.border = element_blank()
  )

```


Our custom priors are rather consistent with the default ones, albeit somewhat more informative. 
With the exception of $\sigma$. __Any thoughts about that?__

We will come back to $\sigma$ later.
Let's first fit the model to the data using our carefully crafted priors.


```{r fm1-best}
fm1_best <- rstanarm::stan_glm(
  y ~ 1 + group,
  family = gaussian(),
  data = best,
  prior_intercept = best_prior_b0,
  prior           = best_prior_bdrug,
  prior_aux       = best_prior_sigma,
  sparse = TRUE   ## Disables "centering" of predictors
)

```

```{r best-posterior-comparison}
#| fig.cap = cap,
#| echo    = FALSE
cap <- paste(
  "Posterior density estimates of the drug effect using",
  "<span style='color:lightgrey'>**default**</span> and",
  "<span style='color:steelblue'>**custom**</span>",
  "priors."
)
tibble(
  priors = fct_inorder(c("Default", "Custom")),
  results = list(
    mcmc_areas_data(fm0_best, pars = 'groupdrug'),
    mcmc_areas_data(fm1_best, pars = 'groupdrug')
  )
) |> 
  unnest(results) |> 
  ggplot() +
  aes(x, density) +
  geom_area(
    aes(fill = priors),
    alpha = 0.3,
    position = 'identity',
    show.legend = FALSE
  ) +
  scale_fill_manual(values = c(Default = "lightgrey", Custom = "steelblue")) +
  labs(
    x = expression(beta[drug]),
    y = NULL
  ) +
  theme(
    axis.text.y = element_blank(),
    axis.line.y = element_blank(),
    axis.ticks.y = element_blank(),
    panel.border = element_blank()
  )

```


As you can see, despite the relatively important differences in the prior
specifications, the results are essentially identical. __Why do you think this
is the case?__


## In summary

- Bayesian estimation yields much richer information about group differences (or any model parameter) than the classical counterparts. There is literally no reason to continue performing $t$-tests (or ANOVA) nowadays [@kruschkeBayesianEstimationSupersedes2013].

- But we are not done yet. Before trusting these results we need to perform
some checks and validations.


# Model assessment


## MCMC diagnostics


The last section of the `summary()` reports some basic diagnostic values for each parameter in the model.

```{r fm1-best-summary}
summary(fm1_best)
```

The text at the end of the summary provides some very succinct information for
their interpretation.
Essentially,

- `mcse` the Monte Carlo standard error is the error in estimates due to computing them from finite samples. Obviously, the smaller the better. But any magnitude under the practical resolution will suffice.

- `Rhat` is a technical measure of the convergence of the chains. It essentially evaluates whether estimates computed from the different chains are similar (which is why 4 chains are computed independently). It should be close to 1 (ideally, $\pm$ 0.01, but $\pm$ 0.1 is often sufficient).

- `n_eff` is a measure of effective sample size (ESS). Can be interpreted as the amount of information in the simulation in terms of a equivalent number of independent draws. The higher the better.

See the documentation on [Runtime warnings and convergence problems](https://mc-stan.org/misc/warnings.html) for more details.


## Posterior predictive checks

The results of a statistical analysis are conditional to the model, i.e., are only meaningful as long as the model is _true_. Or, more pragmatically, if the model is a reasonable approximation to the real _data-generating process_.

This is classically performed by examining various diagnostic plots, such as histograms of residuals or scatter plots of predicted values and observed data.

In a Bayesian setting, we get full posterior __predictive__ distributions for the outcomes, which __combine__ the uncertainties in the model parameters with the variability of the likelihood [^6].

[^6]: In contrast, frequentist procedures typically yield point and interval
estimates from maximum-likelihood parameter values, neglecting the uncertainty
in their estimation.

We can use this posterior predictive distributions to simulate new outcomes and compare them to the actual observed dataset in various ways [@gabryVisualizationBayesianWorkflow2018].

We can use `rstanarm::pp_check()` with various options for examining the 
posterior predictive distributions from different angles.


```{r fm1-best-ppc-dens}
#| fig.width = 8,
#| fig.cap = paste(
#|    "Empirical density of the observed outcomes (y)",
#|    "and the corresponding densities from 8 replicates",
#|    "from the fitted model."
#| )
pp_check(fm1_best, plotfun = 'dens')
```

It looks like the overall _shape_ of the replicated outcomes is rather different from the actual values.

However, 8 replications are too few to get a reliable sense of the variation.
Let's overlay a few more, while splitting groups at the same time.

```{r fm1-best-ppc-dens-overlay-grouped}
#| fig.width = 8,
#| fig.cap = paste(
#|    "Empirical density of the observed outcomes (y)",
#|    "split by group",
#|    "and the corresponding densities from 100 replicates",
#|    "from the fitted model."
#| )

pp_check(fm1_best, plotfun = 'dens_overlay_grouped', group = 'group')
```

We can also compare the distribution of some __summary statistic__ of the 
__replicated__ data with the __realised__ statistic for the observed data.

This helps us to challenge our current model and identify in which ways it does
not accurately reproduce the patterns in the data.

The default statistic is the sample average (Fig. \@ref(fig:fm1-best-ppc-stat-mean)).
The posterior predictive distribution of this statistic is summarised as `mean_PPD`,
in the output of `summary()`.

```{r fm1-best-ppc-stat-mean}
#| fig.cap = "Posterior predictive check for the sample average.",
#| message = FALSE,
#| warning = FALSE

pp_check(fm1_best, plotfun = 'stat')
```

As expected, simulating new data from the fitted model yields _virtual_ samples
with averages that vary around the observed average.
 
From the previous plots and the observation that the default prior
for $\sigma$ (based on the scale of the data) was narrower than our prior
expectations, we can be interested in looking at the [_kurtosis_](https://en.wikipedia.org/wiki/Kurtosis) and the standard
deviation of the data.

```{r fm1-best-ppc-stat-kurtosis}
#| fig.width = 8,
#| message = FALSE
## Compute the kurtosis of a sample x
kurtosis <- function(x) {
  m4 <- mean((x - mean(x)) ^ 4)
  kurtosis <- m4 / (sd(x) ^ 4) - 3
  kurtosis
}

pp_check(
  fm1_best,
  plotfun = 'stat_grouped',
  stat = kurtosis,
  group = 'group'
)
```


```{r fm1-best-ppc-stat-sd}
#| fig.width = 8,
#| message = FALSE
pp_check(
  fm1_best,
  plotfun = 'stat_grouped',
  stat = sd,
  group = 'group'
)
```

The model seems to be failing at reproducing the outliers in the data.
And somewhat compensating with increased SD, specially in the placebo group.
Besides, the dispersions within groups should probably be allowed to be 
different.


## In summary

- Using posterior predictive checks, we identified 2 issues in the model specification: the dispersion of the outcomes should probably be different in each group and the Gaussian likelihood is not appropriate due to more extreme observations
than what would be expected from a Gaussian distribution.

- Addressing these issues is straightforward in principle. We would simply replace the Gaussian likelihood by a Student-$t$ and using a different parameter for the standard deviation in each group. However, such a _Student_ likelihood is not built into `rstanarm`.

    We could instead specify this model in [`brms`](https://paul-buerkner.github.io/brms/) [@burknerBrmsPackageBayesian2017] or in [Stan](https://mc-stan.org/) directly, which are each more flexible and powerful, but require compilation of the model and different syntax. Not particularly difficult, but it's an additional layer that I didn't want to introduce today.
    I have left the code in the [source file](https://umr-astre.pages.mia.inra.fr/training/crashbayes/articles/crashbayes.R)
    if you are curious.


```{r fm2-best}
#| eval = FALSE,
#| echo = FALSE
## This is how the corrected model looks in brms
fm2_best <- brms::brm(
  brms::bf(y ~ 1 + group, sigma ~ 1 + group),
  family = brms::student(),
  data = best,
  prior = c(
    brms::set_prior('normal(100, 5)', class = 'Intercept'),
    brms::set_prior('normal(0, 14.56)', class = 'b'),
    brms::set_prior('normal(0, 1)', class = 'b', dpar = 'sigma'),
    brms::set_prior('normal(log(15), 1)', class = 'Intercept', dpar = 'sigma')
  )
)
```

```{r fm2-best-ppc}
#| eval = FALSE,
#| echo = FALSE
ppc_dens(
  best$y,
  posterior_predict(fm2_best, ndraws = 8)
)

fm2_best_pred <- posterior_predict(fm2_best, ndraws = 1e2)
ppc_dens_overlay_grouped(best$y, fm2_best_pred, group = best$group) +
  coord_cartesian(xlim = c(60, 140))
  
ppc_stat(best$y, fm2_best_pred, stat = e1071::kurtosis, binwidth = .5)

ppc_stat_grouped(best$y, fm2_best_pred, stat = sd, group = best$group, binwidth = .3) +
  coord_cartesian(xlim = c(0, 30))

mcmc_areas(fm2_best, pars = 'b_groupdrug')
```


# Conclusions

- Given sufficiently informative data, the influence of the prior is negligible
and both frequentist and Bayesian approaches give similar numerical point and
interval estimates.

- Using priors can be crucial for: combining prior information with new evidence
to progressively update scientific knowledge using solid probabilistic
foundations; obtain valid results even with only sparse data is available; avoid
numerical issues of multi-collinearity or divergences.

- Bayesian results have natural and straightforward interpretations in terms of
probabilities, which represent the knowledge and the uncertainties about
quantities of interest.

- Using `rstanarm::stan_glm()` and `rstanarm::stan_glmer()` you can effectively
and easily fit Bayesian LMs, GLMs and GLMMs. This should cover most of your
regression modelling needs, with with the additional benefits of using priors as
a safety guard against over-fitting and posteriors for straightforward
interpretation of results and for model evaluation.

- With a little bit more experience and training you can unlock `brms` and
`rstan` for additional modelling flexibility. They require some more technical
knowledge. However, the fundamental concepts and workflow for Bayesian analysis
remain the same.


# Bibliography