Simulation study of a model
Function report_sim()
automates the process of doing a
simulation study. We are still experimenting with
report_sim()
, and the interface may change. Suggestions are
welcome, ideally through raising an issue here.
The most straightforward type of simulation is when the
estimation model' used to do the inference matches the
data
generation model’ used to create the data. Even when the estimation
model matches the data generation model, the inferred values for the
parameters will not exactly reproduce the actual values, since data is
drawn at random, and provides a noisy signal about parameters it was
generated from. However, if the experiment is repeated many times, with
a different randomly-drawn dataset each time, the errors should more or
less average out at zero, 50% credible intervals should contain the true
values close to 50% of the time, and 95% credible intervals should
contain the true values close to 95% of the time.
To illustrate, we use investigate the performance of a model of divorce rates in New Zealand. We reduce the number of ages and time periods to speed up the calculations.
library(bage)
library(dplyr, warn.conflicts = FALSE)
library(poputils)
divorces_small <- nzl_divorces |>
filter(age_upper(age) < 40,
time >= 2016) |>
droplevels()
mod <- mod_pois(divorces ~ age * sex + time,
data = divorces_small,
exposure = population)
mod
#>
#> ------ Unfitted Poisson model ------
#>
#> divorces ~ age * sex + time
#>
#> exposure = population
#>
#> term prior along n_par n_par_free
#> (Intercept) NFix() - 1 1
#> age RW() age 4 4
#> sex NFix() - 2 2
#> time RW() time 6 6
#> age:sex RW() age 8 8
#>
#> disp: mean = 1
#>
#> n_draw var_time var_age var_sexgender
#> 1000 time age sex
To do the simulation study, we pass the model to
report_sim()
. If only one model is supplied,
report_sim()
assumes that that model should be used as the
estimation model and as the data generation model. By default
report_sim()
repeats the experiment 100 times, generating a
different dataset each time.
set.seed(0)
res <- report_sim(mod_est = mod)
res
#> $components
#> # A tibble: 9 × 7
#> term component .error .cover_50 .cover_95 .length_50 .length_95
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) effect 0.0189 0.54 0.94 1.18 3.41
#> 2 age effect -0.132 0.5 0.962 1.42 4.11
#> 3 age hyper 0.0570 0.37 0.83 0.571 2.00
#> 4 sex effect 0.0154 0.49 0.955 1.13 3.28
#> 5 time effect -0.0289 0.485 0.963 1.26 3.67
#> 6 time hyper 0.0319 0.54 0.92 0.410 1.34
#> 7 age:sex effect 0.0149 0.534 0.938 1.38 4.00
#> 8 age:sex hyper 0.0237 0.45 0.83 0.454 1.62
#> 9 disp disp -0.0295 0.53 0.91 0.257 0.762
#>
#> $augment
#> # A tibble: 2 × 7
#> .var .observed .error .cover_50 .cover_95 .length_50 .length_95
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 .fitted 134. 0.000312 0.508 0.947 0.00802 0.0233
#> 2 .expected 134. -3.91 0.504 0.935 106. 334.
The output from report_sim()
is a list of two data
frames. The first data frame contains results for parameters associated
with the components()
function: main effects and
interactions, associated hyper-parameters, and dispersion. The second
data frame contains results for parameters associated with the
augment()
function: the lowest-level rates parameters.
As can be seen in the results, the errors do not average out at
exactly zero, 50% credible intervals do not contain the true value
exactly 50% of the time, and 95% credible intervals do not contain the
true value exactly 95% of the time. However, increasing the number of
simulations from the default value of 100 to, say, 1000 will reduce the
average size of the errors closer to zero, and bring the actual coverage
rates closer to their advertised values. When larger values of
n_sim
are used, it can be helpful to use parallel
processing to speed up calculations, which is done through the
n_core
argument.
One slightly odd feature of the results is that the mean for
.expected
is very large. This reflects the fact that the
data generation model draws some extreme values. We are developing a set
of more informative priors that should avoid this behavior in future
versions of bage.
In actual applications, no estimation model ever perfectly describes the true data generating process. It can therefore be helpful to see how robust a given model is to misspecification, that is, to cases where the estimation model differs from the data generation model.
With report_sim()
, this can be done by using one model
for the mod_est
argument, and a different model for the
mod_sim
argument.
Consider, for instance, a case where the time effect is generated
from an AR1()
prior, while the estimation model continues
to use the default value of a RW()
prior,
mod_ar1 <- mod |>
set_prior(time ~ AR1())
mod_ar1
#>
#> ------ Unfitted Poisson model ------
#>
#> divorces ~ age * sex + time
#>
#> exposure = population
#>
#> term prior along n_par n_par_free
#> (Intercept) NFix() - 1 1
#> age RW() age 4 4
#> sex NFix() - 2 2
#> time AR1() time 6 6
#> age:sex RW() age 8 8
#>
#> disp: mean = 1
#>
#> n_draw var_time var_age var_sexgender
#> 1000 time age sex
We set the mod_sim
argument to mod_ar1
and
generate the report.
res_ar1 <- report_sim(mod_est = mod, mod_sim = mod_ar1)
res_ar1
#> $components
#> # A tibble: 8 × 7
#> term component .error .cover_50 .cover_95 .length_50 .length_95
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) effect -0.101 0.52 0.94 1.18 3.38
#> 2 age effect -0.173 0.493 0.942 1.44 4.17
#> 3 age hyper 0.108 0.35 0.81 0.583 2.05
#> 4 sex effect 0.107 0.5 0.955 1.13 3.28
#> 5 time effect 0.0261 0.61 0.943 1.25 3.61
#> 6 age:sex effect 0.0277 0.464 0.934 1.41 4.08
#> 7 age:sex hyper -0.0317 0.47 0.91 0.471 1.63
#> 8 disp disp -0.0536 0.48 0.96 0.240 0.708
#>
#> $augment
#> # A tibble: 2 × 7
#> .var .observed .error .cover_50 .cover_95 .length_50 .length_95
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 .fitted 64.0 -0.0000399 0.506 0.95 0.00771 0.0223
#> 2 .expected 64.0 3.33 0.518 0.943 36.6 122.
In this case, although actual coverage for the hyper-parameters (in
the components
part of the results) now diverges from the
advertised coverage, coverage for the low-level rates (in the
augment
part of the results) is still close to advertised
coverage.
report_sim()
and
replicate_data()
Functions report_sim()
and replicate_data()
overlap, in that both use simulated data to provide insights into model
performance. Their aims are, however, different. Typically,
report_sim()
is used before fitting a model, to assess its
performance across a random selection of possible datasets, while
replicate_data()
is used after fitting a model, to assess
its performance on the dataset to hand.