Overview of the Stan library in R
Congratulations!
Stan is a C++ library for Bayesian modeling and inference. She uses the NUTS sampler to create posterior simulations models based on user-specified models and data. Likewise, Stan can use the LBFGS optimization algorithm to maximize an objective function, for example as a log-likelihood.
To facilitate working with Stan from the R programming language, the rstan package is available, which provides an R interface to Stan.
Today we are going to take a look at this package.
Contents
Installation
Open RStudio and run the command in the console to install the package:
нinstall.packages("rstan", dependencies = TRUE)
Stan uses C++ for high-performance computing, so make sure your system is configured to compile C++ code. This may require installing Rtools for Windows or Xcode for Mac. After installation, execute the command:
pkgbuild::has_build_tools(debug = TRUE)
Well, just in case, let’s check if everything is OK, it’s done like this:
example(stan_model, package = "rstan", run.dontrun = TRUE)
If there are no errors, then you are a great person.
Package rstan also depends heavily on several other R packages:
-
StanHeaders (Stan C++ headers)
-
BH (Boost C++ headers)
-
RcppEigen (C++ custom headers)
-
Rcpp (Makes it easier to use C++ with R)
-
built in (compiles C++ for use with R)
These dependencies should be installed automatically if you install the package rstan using one of the usual mechanisms.
A little about Stan himself
Stan is an object-oriented language optimized for statistical modeling tasks.
Stan uses a declarative syntax, that is, the user defines the model in terms of statistical distributions without having to specify algorithms to compute inferences. There is only one plus in this, and at least yes – it is clearer and more convenient.
Automatic differentiation allows Stan to automatically calculate the gradients of the loss function, which is a must-have for the optimization algorithms and the Monte Carlo Markov chain methods used for derivation.
Stan provides access to a number of algorithms for Bayesian inference, including:
-
NUTS, which is an extension of the HMC algorithm.
-
Variational methods that offer a faster but approximate alternative to MCMC for inference.
Stan supports modularityAllowing users to define their own functions, which facilitates code reuse and facilitates the creation of complex models.
Stan has many integrations with different NAPs: R, Python, Julia, and MATLAB.
There is also a very good and active community on their forum.
Structure of the Stan program
The program on Stan consists of several blocks, each of which performs a specific role in the simulation process:
-
Data block (
data
): all inputs used in the model are declared here. These can be scalars, vectors, matrices, etc. The data in this block is assumed to be fixed and unchanged throughout the simulation process. -
Parameter block (
parameters
): the model parameters that will be estimated from the data are declared here. It can be, for example, regression coefficients, variances, scale coefficients, etc. Stan automatically applies Monte Carlo methods with Markov chains to estimate these parameters. -
Model block (
model
): the base of the program on Stan. A probabilistic model of data and parameters is defined here. You can use Stan’s built-in distributions and functions to describe a priori parameter distributions and probabilities of data observed for given parameters. -
Blocks of transformations (
transformed data
,transformed parameters
): blocks are used to transform data and parameters respectively. Transformed data can be useful for model simplification or data preprocessing, and transformed parameters are often used to derive secondary quantities that are calculated from the estimated parameters. -
Generation block (
generated quantities
): here you can calculate the values that depend on the parameters after their evaluation. Often used to generate forecasts and calculate posterior distribution statistics.
Let’s go straight to creating, for example, a model linear regression:
// определяем модель линейной регрессии
data {
int<lower=0> N; // количество наблюдений
vector[N] y; // зависимая переменная
int<lower=0> K; // колво предикторов
matrix[N, K] X; // матрица предикторов
}
parameters {
vector[K] beta; // коэф. регрессии
real alpha; // перехват (интерсепт)
real<lower=0> sigma; // отклонение ошибок
}
model {
// априорные распределения
beta ~ normal(0, 10); // нормальное априорное распределение для коэффициентов
alpha ~ normal(0, 10); // нормальное априорное распределение для перехвата
sigma ~ cauchy(0, 5); // распределение Коши для стандартного отклонения ошибок
// вероятностная модель данных
y ~ normal(X * beta + alpha, sigma); // нормальное распределение зависимой переменной
}
generated quantities {
vector[N] y_pred; // вектор предсказанных значени
for (i in 1:N) {
y_pred[i] = normal_rng(X[i] * beta + alpha, sigma); // генерация предсказаний
}
}
So:
IN data blocks (data
): we declare the necessary data for the model: the number of observations N
dependent variable y
number of predictors K
and matrix of predictors X
.
Parameters block (parameters
): the parameters of the model to be evaluated are declared: vector of coefficients beta
interception alpha
and standard deviation of errors sigma
.
Block model (model
): a priori distributions for parameters and a probabilistic data model are determined. We use the normal distribution for the coefficients and intercept, and the Cauchy distribution for the standard deviation of the errors. Then we indicate that the observed data y
are normally distributed with a mathematical expectation equal to a linear combination of predictors and a standard deviation sigma
.
Generation unit (generated quantities
): generate predicted values y_pred
for each observation. This is done using a function normal_rng
which generates random values from a normal distribution with given parameters.
Stan’s functionality can be extended with user functions written in R.
Let’s say we have a set of data and we want to use some features from the R package to pre-process that data before using it in a Stan model. We can do this by performing R preprocessing and then passing the processed data to the Stan model:
library(rstan)
library(dplyr)
# наш дф
df <- data.frame(
x = rnorm(100),
y = rnorm(100, 2, 4)
)
# dplyr для предварительной обработки данных
processed_data <- df %>%
mutate(x_scaled = scale(x), y_scaled = scale(y))
# заготовка данных для Stan
stan_data <- list(
N = nrow(processed_data),
x = processed_data$x_scaled,
y = processed_data$y_scaled
)
# stan
stan_code <- "
data {
int<lower=0> N;
vector[N] x;
vector[N] y;
}
parameters {
real alpha;
real beta;
real<lower=0> sigma;
}
model {
y ~ normal(alpha + beta * x, sigma);
}
"
# запуск модели
fit <- stan(model_code = stan_code, data = stan_data)
Using C++ functions
You can write functions in C++, compile them into dynamic link libraries (DLLs or .so files depending on your OS), and then call those functions directly from the Stan model.
Suppose we have the following C++ function that we want to use in a Stan model. It is important that the function is “pure”, i.e. had no side effects, and her conclusion was entirely determined by her input:
#include <cmath>
extern "C" {
double my_custom_function(double x) {
return std::exp(x); // простой пример, возвращающий экспоненту входного значения
}
}
Next, we use the C++ compiler, for example g++
to compile a function into a DLL or .so file:
g++ -shared -fPIC -o my_custom_function.so my_custom_function.cpp
To use an external function of the Stan model, you must declare it in a block functions
:
functions {
// объявление внешней функции
real my_custom_function(real x);
}
model {
// использование функции в модели
real y = my_custom_function(0.5);
}
When compiling and running the Stan model with R, you need to make sure that the dynamically linked library is available for the R session. Use the function dyn.load
R to load the .so or DLL file before compiling the Stan model:
# загрузка динамически подключаемой библиотеки в R
dyn.load("path/to/my_custom_function.so")
Time series models
Autoregressive model first-order (AR(1)) assumes that the current value of the time series depends on its previous value with the addition of random noise:
data {
int<lower=0> N; // колво наблюдений
vector[N] y; // временной ряд
}
parameters {
real alpha; // параметр авторегрессии
real beta; // коэффициент уровня
real<lower=0> sigma; // стандартное отклонение шума
}
model {
for (n in 2:N)
y[n] ~ normal(alpha + beta * y[n-1], sigma);
}
Let’s start:
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
# грузим данные
N <- 100
y <- rnorm(N, 0, 1) # данные временного ряда
# данные для Stan
stan_data <- list(N = N, y = y)
# комп. и запуск модели
fit_ar1 <- stan(file="ar1_model.stan", data = stan_data, iter = 2000, chains = 4)
print(fit_ar1)
Model with local level and seasonality suitable for analyzing time series with trend and seasonal fluctuations:
data {
int<lower=0> N; // колво наблюдений
int<lower=0> K; // период сезонности
vector[N] y; // временной ряд
}
parameters {
vector[N] level; // лок. уровень
vector[K] season; // сезонные эффекты
real<lower=0> sigma_level; // шум уровня
real<lower=0> sigma_season; // шум сезонности
}
model {
for (n in 2:N)
level[n] ~ normal(level[n-1], sigma_level); // мдель уровня
for (n in (K+1):N)
y[n] ~ normal(level[n] + season[n mod K + 1], sigma_season); // сезонная модель
// приоры для сезонности
season ~ normal(0, sigma_season);
}
Let’s start:
# допустим, что данные и файл модели уже подготовлены
K <- 12 # период сезонности
stan_data <- list(N = N, K = K, y = y)
fit_seasonal <- stan(file="seasonal_model.stan", data = stan_data, iter = 2000, chains = 4)
print(fit_seasonal)
A model with a local trend allows you to analyze time series with a trend that changes over time.
data {
int<lower=0> N;
vector[N] y;
}
parameters {
vector[N] mu; // локальный тренд
real<lower=0> sigma_mu; // шум тренда
}
model {
for (n in 2:N)
mu[n] ~ normal(mu[n-1], sigma_mu);
y ~ normal(mu, sigma_mu);
}
Let’s start:
# также допускаем, что данные и файл модели уже подготовлены
stan_data <- list(N = N, y = y)
fit_trend <- stan(file="local_trend_model.stan", data = stan_data, iter = 2000, chains = 4)
print(fit_trend)
Processing of missing data and partially known parameters
Modeling missing data as parameters. In this approach, missing data are treated as unknown parameters to be estimated in the modeling process:
Consider the case where there is a dataset with missing values in a variable y
:
data {
int<lower=0> N_obs; // колво наблюдаемых данных
int<lower=0> N_miss; // колво отсутствующих данных
real y_obs[N_obs]; // наблюд. данные
}
parameters {
real mu; // ср. значение распределения
real<lower=0> sigma; // дефолтное отклонение распределения
real y_miss[N_miss]; // отсутствующие данные, моделируемые как параметры
}
model {
y_obs ~ normal(mu, sigma); // модель для наблюдаемых данных
y_miss ~ normal(mu, sigma); // модель для отсутствующих данных
}
y_miss
is modeled in the same way as the observed data with the same parameters mu
and sigma
.
Use of predictive values to handle missing data. This approach consists in using predictive values to fill in missing data:
data {
int<lower=0> N; // общее количество данных (наблюдаемые + отсутствующие)
int<lower=0,upper=1> missing[N]; // индикатор отсутствующих данных
real y[N]; // данные, где отсутствующие значения обозначены как NaN
}
parameters {
real mu;
real<lower=0> sigma;
real y_miss[N];
}
model {
for (n in 1:N) {
if (missing[n] == 1) {
y_miss[n] ~ normal(mu, sigma); // модель для отсутствующих данных
} else {
y[n] ~ normal(mu, sigma); // модель для наблюдаемых данных
}
}
}
missing
is an array pointer that determines whether the value in y
absent If the value is missing, then for the corresponding element in y_miss
model is applied.
Sometimes the parameters of the model may be partially known from previous studies or expert evaluations. Stan allows you to integrate this information into the model:
data {
int<lower=0> N;
real y[N];
real<lower=0> sigma_prior; // частично известное стандартное отклонение
}
parameters {
real mu;
real<lower=0> sigma;
}
model {
sigma ~ normal(0, sigma_prior); // использование частично известного параметра
y ~ normal(mu, sigma);
}
sigma_prior
is used to assign a priori parameter distribution sigma
Survival models
Exponential model assumes that the intensity of the event (or risk) is constant over time. This is the simplest form of the survival model:
data {
int<lower=0> N; // колво наблюдений
vector[N] T; // тайм до события или цензурирования
int<lower=0,upper=1> delta[N]; // индикатор события (1 если событие произошло, 0 если данные цензурированы)
}
parameters {
real<lower=0> lambda; // параметр интенсивности события
}
model {
// априорное распределение для lambda
lambda ~ gamma(0.01, 0.01);
// правдоподобие
for (n in 1:N) {
if (delta[n] == 1) {
// для наблюдений с событием
target += log(lambda) - lambda * T[n];
} else {
// для цензурированных наблюдений
target += -lambda * T[n];
}
}
}
Weibull model extends the exponential model to allow risk to vary over time. Can be implemented by adding a form parameter:
data {
int<lower=0> N;
vector[N] T;
int<lower=0,upper=1> delta[N];
}
parameters {
real<lower=0> lambda; // параметр масштаба
real<lower=0> k; // параметр формы
}
model {
lambda ~ gamma(0.01, 0.01);
k ~ gamma(0.01, 0.01);
for (n in 1:N) {
if (delta[n] == 1) {
target += log(k) + (k - 1) * log(T[n]) - pow(T[n] * lambda, k);
} else {
target += -pow(T[n] * lambda, k);
}
}
}
Cox proportional hazards model allows analyzing the impact of risk explanatory variables without the need to specify the underlying risk function.
In Stan, directly simulating the Cox model can be difficult, and I think it even deserves a separate article to some extent due to its partial plausibility.
To implement the Cox model in Stan, you can consider using Bayesian regression models with priors on regression coefficients and approximation of the basic risk function, for example, using exponential or Weibull distributions.
Clustering models
There are many approaches to implementing clustering models in Stan.
K-means model can be implemented by optimizing the distance between data points and cluster centers.
data {
int<lower=0> N; // колво точек данных
int<lower=1> K; // предполагаемое количество кластеров
vector[N] x; // данные
}
parameters {
vector[K] mu; // предполагаемые центры кластеров
real<lower=0> sigma; // стандартное отклонение
}
model {
sigma ~ normal(0, 1); // априорное распределение для sigma
for (n in 1:N) {
vector[K] log_prob;
for (k in 1:K) {
log_prob[k] = normal_lpdf(x[n] | mu[k], sigma);
}
target += log_sum_exp(log_prob); // логарифм суммы экспонент логарифмических вероятностей
}
}
The model seeks to find such values mu
and sigma
which maximize the likelihood of the observed data, which is analogous to minimizing the distance between data points and cluster centers in K-means.
Gaussian mixture model allows you to cluster the data, assuming each cluster is generated from a Gaussian distribution:
data {
int<lower=1> K; // колво кластеров
int<lower=1> N; // колво наблюдений
vector[2] y[N]; // данные
}
parameters {
simplex[K] theta; // веса смеси
vector[2] mu[K]; // ср. значения кластеров
cov_matrix[2] Sigma[K]; // ковариационные матрицы
}
model {
vector[K] log_theta = log(theta);
for (n in 1:N) {
vector[K] log_phi_n;
for (k in 1:K) {
log_phi_n[k] = multi_normal_log(y[n], mu[k], Sigma[k]);
}
target += log_sum_exp(log_phi_n + log_theta);
}
}
LDA is a popular model for topic-based clustering of textual data:
data {
int<lower=1> W; // колво слов в словаре
int<lower=1> D; // колво документов
int<lower=1> N; // общее количество слов
int<lower=1> word[N]; // идексы слов
int<lower=1> doc[N]; // индексы документов
int<lower=1> K; // количество тем
}
parameters {
simplex[K] theta[D]; // распределение тем в документах
simplex[W] phi[K]; // расределение слов в темах
}
model {
for (n in 1:N) {
vector[K] log_theta = log(theta[doc[n]]);
vector[K] log_phi = log(phi[:, word[n]]);
target += log_sum_exp(log_theta + log_phi);
}
}
A little about visualization
Once the model is compiled and sampling is done, you can use basic R functions such as plot
to visualize the results. For example, for a simple histogram of a model parameter:
library(rstan)
# предположим, что `fit` - это объект, возвращенный функцией stan()
samples <- extract(fit)$parameter_name # извлекаем семплы параметра
hist(samples, breaks = 40, main = "Распределение параметра", xlab = "Значение параметра")
By using bayesplot
MCMC inference can be visualized, including traces, density plots, HPD intervals, and diagnostic plots.
The graph traces show how the parameter values changed during sampling:
color_scheme_set("brightblue")
mcmc_trace(fit, pars = c("alpha", "beta"))
Density graphs show the posterior distribution of parameters:
mcmc_areas(fit, pars = c("alpha", "beta"))
The Rhat and ESS diagnostic graphs help assess sampling quality:
mcmc_diagnostics(fit)
There is still shinystan
which is an interactive interface for analyzing and visualizing simulation results:
# устанавливаем и запускаем shinystan
if (!requireNamespace("shinystan", quietly = TRUE)) {
install.packages("shinystan")
}
library(shinystan)
launch_shinystan(fit)
Stan provides an intuitive modeling language, support for a wider range of statistical models and parameter estimation algorithms.
You can read more about the documentation here.
And OTUS experts talk about analytics tools in practical online courses. Get acquainted with the catalog.