Life after pandemic

Hello fellow internet-nauts!

Probably it was noticed only by the internet bots that intensively annoy this space, but I’ve been away during these two years of pandemic. Reasons for this are diverse but has to do with taking care of my family (specially my dog), myself and managing some issues related to anxiety, excessive workloads and, why not, enjoying from time to time my city.

I managed to finish my PhD and publish the relevant chapters. You may be able to find the links to them in the “publications” section. I assume, thus, that I’ll be more relaxed and will try to add new posts relevant to ecological modelling, spatial statistics, computer stuff and, why not, emancipation of our digital lives. Please do comment on this space to give some ideas or anything. Of course, those bots that post p0rn, scams or any other ill-intentioned links will be banned and send to their respective chingada madre!

Thank you for reading and keep you posted,

Juan

STAN – Reuse your warmed-up sample !

MCMC is a great framework for sampling posterior distributions of complex statistical models. In order to sample from these distribution, the markov chains need to reach stationarity. This condition is often computationally intensive and (as far as I know) there is no way to fasten this stage with the inclusion of independent chains (that can be parallelised).

It could be, however, that once we reach stationarity, we can sample from independent chains simultaneously using threads or multicore processors or even in independent computers. The way to do this is described in the following example:

from pystan import StanModel
from pystan.constants import MAX_UINT
# bernoulli model
model_code = """
    data {
      int<lower=0> N;
       int<lower=0,upper=1> y[N];
        }
     parameters {
          real<lower=0,upper=1> theta;
        }
      model {
        theta ~ beta(0.5, 0.5);  // Jeffreys' prior
        for (n in 1:N)
           y[n] ~ bernoulli(theta);
        }
    """
data = dict(N=10, y=[0, 1, 0, 1, 0, 1, 0, 1, 1, 1])
sm = StanModel(model_code=model_code)
# initial seed can also be chosen by user
# MAX_UINT = 2147483647
seed = np.random.randint(0, MAX_UINT, size=1)
fit = sm.sampling(data=data, seed=seed)


# reuse tuned parameters
stepsize = fit.get_stepsize()
# by default .get_inv_metric returns a list
inv_metric = fit.get_inv_metric(as_dict=True)
init = fit.get_last_position()
# increment seed by 1
seed2 = seed + 1
control = {"stepsize" : stepsize,
         "inv_metric" : inv_metric,
         "adapt_engaged" : False
           }

fit2 = sm.sampling(data=data,
                  warmup=0,
                  iter=1000,
                  control=control,
                  init=init,
                  seed=seed2)

This example was taken from here .

Logistic Hierarchical regression

This is a blunt translation from my original post. Just to bring bilingual audience, I actually used google translator, hehehe!

I am surprised by the ability and flexibility to express complex statistical models with STAN. As you know, STAN is a declarative language of probabilistic programming. Statistical models are described with a simple syntax to specify in a way very similar to what we do on the board or on paper. STAN is based on Bayesian principles and uses several methods to sample subsequent distributions, among the best known are Hamiltonian Monte Carlo, Non U Turn and Metropolis-Hastings (I’m not going to get into detail on this now, on the internet there are lots of references about it ).

In this post we will see how to define a hierarchical logistic model. Suppose that for any binary observation $ latex y_i \ in \ {0,1 \} $ there is an associated level $ latex j_i \ in \ {1,2, …, J \} $ and a vector of covariates (predictors) $ latex x_i \ in \ mathbb {R} ^ K $. We will therefore assume that the observations are realizations of an independent bernoulli process when it is conditioned with a normal hierarchical process. This is:

\([Y | S] \sim Bernoulli(S)\)

Now, $ latex S $ can be described as follows:

\(S_{i,j} = \beta^t_j x_i + \epsilon\)

And it reads as follows. The variable S in observation i of level j is equal to $ latex \ beta ^ t_j x_ {i} + \ epsilon $. That is, the vector $ latex \ beta $ is the coefficients associated with the response $ latex S_ {i} $ at level j.

In STAN this is equivalent to:

hierarquical_log_model = u"""
data {
  int<lower=0> N; // num obs. 
  int<lower=0> J; // number of levels
  int<lower=1> K;                // num covariates
  matrix[N, K] x;                // design matrix
  int<lower=1,upper=J> level[N];  // type of level (spec)
  int<lower=0,upper=1> y[N];              // observations, in this case is binary.
  
} 
parameters {
  real mu[K];
  vector[K] beta[J];
  real<lower=0,upper=100> sigma; // same sigma

}

model {
    for (k in 1:K) {
        mu[k] ~ normal(0,100);
        for (j in 1:J)
            beta[j,k] ~ normal(mu[k], sigma);
    }
    for (i in 1:N)
        y[i] ~ bernoulli_logit(x[i] * beta[level[i]]);
  
}"""

For running the model in Python, using PyStan, use:

varying_intercept_model = pystan.StanModel(model_code=hierarquical_log_model)
multilevel_linear_data = {'N': X.shape[0],
                          'J': 2,
                          'K': X.shape[1],
                          'level': data.codes.values, # niveles
                          'x' : X.values,
                          'y': data.obs.values}
%time fit = varying_intercept_model.sampling(data=multilevel_linear_data, iter=1000,chains=1)
sample = fit.extract(permuted=True)

Resources:

Originally taken from: https://mc-stan.org/docs/2_18/stan-users-guide/basic-motivation.html

Very good tutorial: https://mc-stan.org/users/documentation/case-studies/radon.html