Life after pandemic

Hello fellow internet-nauts!

Probably it was noticed only by the internet bots that intensively annoy this space, but I’ve been away during these two years of pandemic. Reasons for this are diverse but has to do with taking care of my family (specially my dog), myself and managing some issues related to anxiety, excessive workloads and, why not, enjoying from time to time my city.

I managed to finish my PhD and publish the relevant chapters. You may be able to find the links to them in the “publications” section. I assume, thus, that I’ll be more relaxed and will try to add new posts relevant to ecological modelling, spatial statistics, computer stuff and, why not, emancipation of our digital lives. Please do comment on this space to give some ideas or anything. Of course, those bots that post p0rn, scams or any other ill-intentioned links will be banned and send to their respective chingada madre!

Thank you for reading and keep you posted,

Juan

STAN – Reuse your warmed-up sample !

MCMC is a great framework for sampling posterior distributions of complex statistical models. In order to sample from these distribution, the markov chains need to reach stationarity. This condition is often computationally intensive and (as far as I know) there is no way to fasten this stage with the inclusion of independent chains (that can be parallelised).

It could be, however, that once we reach stationarity, we can sample from independent chains simultaneously using threads or multicore processors or even in independent computers. The way to do this is described in the following example:

from pystan import StanModel
from pystan.constants import MAX_UINT
# bernoulli model
model_code = """
    data {
      int<lower=0> N;
       int<lower=0,upper=1> y[N];
        }
     parameters {
          real<lower=0,upper=1> theta;
        }
      model {
        theta ~ beta(0.5, 0.5);  // Jeffreys' prior
        for (n in 1:N)
           y[n] ~ bernoulli(theta);
        }
    """
data = dict(N=10, y=[0, 1, 0, 1, 0, 1, 0, 1, 1, 1])
sm = StanModel(model_code=model_code)
# initial seed can also be chosen by user
# MAX_UINT = 2147483647
seed = np.random.randint(0, MAX_UINT, size=1)
fit = sm.sampling(data=data, seed=seed)


# reuse tuned parameters
stepsize = fit.get_stepsize()
# by default .get_inv_metric returns a list
inv_metric = fit.get_inv_metric(as_dict=True)
init = fit.get_last_position()
# increment seed by 1
seed2 = seed + 1
control = {"stepsize" : stepsize,
         "inv_metric" : inv_metric,
         "adapt_engaged" : False
           }

fit2 = sm.sampling(data=data,
                  warmup=0,
                  iter=1000,
                  control=control,
                  init=init,
                  seed=seed2)

This example was taken from here .

Simulating a spatial Conditional Autoregressive Model (CAR) from a graph G(V,E)

This is the second delivery on how to simulate a Conditional Autoregressive Model (CAR) using a given graph (aka, lattice, network, etc). For simulating a CAR model without a given graph see: [ HERE ].

Ingredients

  • An adjacency matrix (M) representation of a lattice. This represented a simple graph, i.e. (\(\delta_{i,j} = 1\)) if there exists an edge between \(i \text{and} j\).
  • [To Plot] A corresponding data frame (GeoPandas) to visualize polygons corresponding to the areas of the lattice.

Methods

Load the necessary data. In this example, the lattice L corresponds is the following figure

## Geometry file
geometryfile = "training_data_sample_puebla_p9_geometry.json"
## Readfile
gg = gpd.read_file(geometryfile)
gg.set_index('indx',inplace=True)
plt.rcParams['figure.figsize'] = [10, 10]
gg.plot(edgecolor='black')
## Read the corresponding adjancy matrix
mat_filename = "training_data_sample_puebla_p9_abies_pinophyta_adjmat.npy"
M = np.load(file=mat_filename)

The order of the columns (rows) of M corresponds to the ascending order of the indices of the polygons in the GeoPandas Data Frame, i.e. the element in gg with the lowest index corresponds to the first column (row) of M, and successively. If this order is not respected the spatial autocorrelation will be lost.

Building the precision matrix Q

Given that the probability distribution of CAR model is a Multivariate Normal distribution \(MVN(0,Q^{-1})\). Simulating a CAR model would correspond to simulating this process. The challenge resides, however, in the construction of the precision matrix \(Q\). We will follow the work of: Besag, 1991 , Rue and Hoavard, 2005 and Morris et. al. 2019 . Any doubt go those references.

During the realisation of this tutorial I had problems simulating an intrinsic CAR Model obtaining a Q that is not semi positive definite (SPD). In this regard we will build Q as follows:

\(Q = \tau^2 D_{w}( I – \alpha B)\)

Where \(\tau^2\) models the general variance, \(D_{w}\) represents the number of neighbours for each node in the graph and B is the matrix resulting from \(B = M/D_{w}\) where M is the corresponding adjacency matrix. B guarantees that Q will be positive definite. The parameter \(\alpha\) represents the degree of spatial autocorrelation and ranges from \((0 to 1)\). I recommend to test different values of \(\alpha\). In addition remove any “islands” nodes without neighbours. Failure to do so can lead to no positive definiteness. The code to achieve this is here:

def buildQ(M,D,tau_2=2.0,alpha=0.99999):
    I = np.identity(M.shape[0])
    D_ = np.diag(np.reciprocal(np.diag(D)))
    B = np.dot(M,D_)
    Q = np.dot(tau_2 * D_,(I - alpha* B))
    return(Q)

## Calculate number of neighbours 
D = np.sum(M,axis=1) 
## Remove islands 
idx_islands = np.where(D == 0 ) 
ND = np.delete(D,idx_islands,axis=0) 
NM = np.delete(M,idx_islands,axis=0) 
NM = np.delete(NM,idx_islands,axis=1)

## Calculate Q
Q = buildQ(NM,ND,tau_2=2.0,alpha=0.995)

 

Here we will visualize the Covariance matrix (\(Q^{-1}\))

 

A we now made sure (by construction, or try and error 😛 ) that Q is Positive Definite, we can therefore calculate the Cholesky decomposition, making the simulation remarkably fast and efficient. Here we follow algorithm 2.4 from Rue and Hovard. The code is as follows:

## Simulation Algorithm 2.4
def generateCARSimulation(Q,mu):
    L = np.linalg.cholesky(Q)
    I = np.identity(L.shape[0])
    z = sp.random.normal(loc=mu,scale=1.0,size=L.shape[0])
    v = np.linalg.solve(L.transpose(),z)
    return(v)
v = generateCARSimulation(Q,mu)

To visualise the simulation just join the random vector v with the GeoPandas dataframe, gg as follows.

## order index (cell_id) ascending, according to M.
ordered_g = gg.sort_index()
## Reset the index, this is necessary in order to join. 
ordered_g.reset_index(inplace=True)
## Create the dataframe
yy = pd.DataFrame(vs[1],columns=['simulated'])
## Join it with the geospatial dataframe
simulation = ordered_g.join(yy)
## Plot it!
simulation.plot(column='simulated')

This is the visualisation.

Simulation a CAR process witgh a agiven adjacency matrix and a Lattice

Thanks for reading and remember to use your skills for the benefit of Mother Earth.

Logistic Hierarchical regression

This is a blunt translation from my original post. Just to bring bilingual audience, I actually used google translator, hehehe!

I am surprised by the ability and flexibility to express complex statistical models with STAN. As you know, STAN is a declarative language of probabilistic programming. Statistical models are described with a simple syntax to specify in a way very similar to what we do on the board or on paper. STAN is based on Bayesian principles and uses several methods to sample subsequent distributions, among the best known are Hamiltonian Monte Carlo, Non U Turn and Metropolis-Hastings (I’m not going to get into detail on this now, on the internet there are lots of references about it ).

In this post we will see how to define a hierarchical logistic model. Suppose that for any binary observation $ latex y_i \ in \ {0,1 \} $ there is an associated level $ latex j_i \ in \ {1,2, …, J \} $ and a vector of covariates (predictors) $ latex x_i \ in \ mathbb {R} ^ K $. We will therefore assume that the observations are realizations of an independent bernoulli process when it is conditioned with a normal hierarchical process. This is:

\([Y | S] \sim Bernoulli(S)\)

Now, $ latex S $ can be described as follows:

\(S_{i,j} = \beta^t_j x_i + \epsilon\)

And it reads as follows. The variable S in observation i of level j is equal to $ latex \ beta ^ t_j x_ {i} + \ epsilon $. That is, the vector $ latex \ beta $ is the coefficients associated with the response $ latex S_ {i} $ at level j.

In STAN this is equivalent to:

hierarquical_log_model = u"""
data {
  int<lower=0> N; // num obs. 
  int<lower=0> J; // number of levels
  int<lower=1> K;                // num covariates
  matrix[N, K] x;                // design matrix
  int<lower=1,upper=J> level[N];  // type of level (spec)
  int<lower=0,upper=1> y[N];              // observations, in this case is binary.
  
} 
parameters {
  real mu[K];
  vector[K] beta[J];
  real<lower=0,upper=100> sigma; // same sigma

}

model {
    for (k in 1:K) {
        mu[k] ~ normal(0,100);
        for (j in 1:J)
            beta[j,k] ~ normal(mu[k], sigma);
    }
    for (i in 1:N)
        y[i] ~ bernoulli_logit(x[i] * beta[level[i]]);
  
}"""

For running the model in Python, using PyStan, use:

varying_intercept_model = pystan.StanModel(model_code=hierarquical_log_model)
multilevel_linear_data = {'N': X.shape[0],
                          'J': 2,
                          'K': X.shape[1],
                          'level': data.codes.values, # niveles
                          'x' : X.values,
                          'y': data.obs.values}
%time fit = varying_intercept_model.sampling(data=multilevel_linear_data, iter=1000,chains=1)
sample = fit.extract(permuted=True)

Resources:

Originally taken from: https://mc-stan.org/docs/2_18/stan-users-guide/basic-motivation.html

Very good tutorial: https://mc-stan.org/users/documentation/case-studies/radon.html

How to load the Biospytial Knowledge Engine

Biospytial is composed of three interconnected components:

All these components are implemented in Docker containers. A docker compose file for Linux and OSX can be downloaded from the source code.

Assuming you have downloaded the code from the repository you may be able to load the full service using the following command.

If it’s a Linux host:

docker stack deploy -c biospytial_stack.yml biospytial

 

If it’s an OSX host:

docker stack deploy -c biospytial_stackOSX.yml biospytial

Stoping the services

To stop the services run:

docker stack rm biospytial

 

The “inside cell” ratio

Let \( C_s = \{c _1 , c_2 , …d_n \}\) be the
set of cells at scale \(s\) where the occurrences of a node X where found. The \(C _{s−1} = \{d_ 1 , d_ 2 , …d _k \}\) is
the corresponding set of cells at an upper scale (ancestor of \(s \)) where the occurrences of a node X where found.

Note that the ratio:
\(r_s = \frac{\#C_{s-1}}{\#C_s}\)

gives us an indicator of how the occurrences are dispersed in the space.

If \(r_s\) is low means that the
spatial distribution is constrained in a region as small as the unit area of the upper scale while if \(r_s\) is close to 1 it tells us that the occurrences are as spatially distributed as the cells in the upper scale.
The method can be applied recursively to the sucessive scales to obtain a list of ratios \(r_1 , r_2 , ..r_s ,.. \) that can be fitted in model to estimate geographic extensions.

A model for presence-only data

This is the set-up of a conditional auto-logistic regressive model (CAR) for predicting species presence using a sample signal and a presence-only data.

Set-up

Let \(Sp \) be a species and \(Y \) and \(X \) two random variables corresponding to the events of: \(Sp \) is in location \(x_i\) and: location \(x_i \) has been sampled. (The variable \(X\) and \(x_i\) are not related)

\(Y \) and \(X \) are consider to be independent binary (Bernoulli) variables conditional to the latent processes \(S \) and \(P\) respectively.

Latent processes

These variables are modeled as this:

$$ logit(S_k) = \beta_s^t d_s(x_k) + \psi_k +O_k $$

$$ logit(P_k) = \beta_p^t d_p(x_k) + \psi_k + O_k $$

Where \(O_k\) is an offset term, \(d_p(x_k), d_s(x_k)\) are the covariates for p and s respectively; and \(\psi_k\) is modeled as a Gaussian Markov Random Field.

Common Gaussian Markov Field (CGMRF)

This process is modeled as this:

$$ \psi_k = \phi_k + \theta_k $$

$$ \phi_k | \phi_{-k}, \mathbb{W},\tau^2 \sim N \left( \frac{\sum_{i=1}^{K} w_{ki} \phi_i}{\sum_{i=1}^{K}w_{ki}}, \frac{\tau^2}{\sum_{i=1}^{K}w_{ki}}\right) $$
$$\theta_k \sim N\left(0, \sigma^2\right)$$

$$\tau^2, \sigma^2 \sim^{iid} Inv.Gamma(a,b) $$

Where \(\mathbb{W}\) is the adjacency matrix of the lattice, \(\theta_k\) is an independent noise term with constant variance. \(sigma^2\) and \(\tau^2\) are independent and identically distributed hyperparameters sampled from an inverse gamma distribution.

The corresponding Directed acyclic Graph can be seen in the next figure.

Implementation

A current implementation of this model can be found here:
Of particular interest is the file: joint.binomial.bymCAR.R where you can find the joint sample between line 113 and 149.


Stationary GMRF simulation with Discrete Fourier Transformation

Algorithm to simulate GMRF with block-circulant Matrix.

Algorithm 2.10

Taken from: Rue, H., & Held, L. (2005). Gaussian Markov random fields: theory and applications. CRC press.

let’s build the circulant matrix for the tourus

Oke, for the moment I´ll follow the example in GMRF book. i.e. a Torus (stationary of 128×128)

def createToroidalCircularBase(d=0.1,N=128):
    """
    Creates a circular base similar to the one described in GMRF Rue and Held, 2005.
    """
    c00 = np.hstack(((4 + d),-1,np.zeros(N - 3),-1))
    c01 = np.hstack((-1,np.zeros(N - 1)))
    c0 = np.zeros((N - 2 ,N))
    c1 = np.vstack((c00,c01))
    c = np.vstack((c1,c0))
    c[N -1, 0] = -1
    return c






 

%%time 
## Create circular base
d = 0.00001
N = 100
c = createToroidalCircularBase(d=d,N=N)
## Simulate random noise (Normal distributed)
from scipy.fftpack import ifft2, fft2
zr = scipy.stats.norm.rvs(size=(c.size,2),loc=0,scale=1,random_state=1234)
zr.dtype=np.complex_
#plt.hist(zr.real)
#Lm = scipy.sqrt(C.shape[0]*C.shape[0]) * fft2(C)

Lm = fft2(c)
v = 1.0/ len(c) * fft2((Lm ** -0.5)* zr.reshape(Lm.shape))
x = v.real
plt.imshow(x,interpolation='None')
Both figures represent a simulation on a 100×100 cells. The comparison is as follows: on the right there is full gaussian field with mean zero and some Matern-based covariance structure. On the right a Markov Random Field simulation using the method described above.

[Bonus] Create a simple precision matrix

An easy way to create a precision matrix is with the function: *diags* inside the module: scipy.sparse. The code will look like this:

from scipy.sparse import diags
N = 10
diagonals = [np.zeros(N), np.ones(N-1), np.ones(N-1)]
# order 0 := diagonal , 1 first on top of diagonal, -1 first bottom of diagonal 
offset = [0,1,-1]
## maps diagonals with position (offset)
M = diags(diagonals,offset).toarray()
## "seal the edges"
M[0][0] = 1
M[N-1][N-1] =1
print(M)

Result:

[[ 1.  1.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 1.  0.  1.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  1.  0.  1.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  1.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.  1.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  1.  0.  1.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  1.  0.  1.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  1.  0.  1.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  1.  0.  1.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  1.  1.]]

Update

There is a new tutorial for simulating a CAR model using a given adjacency matrix and spatial lattice. Go here to read it.

How to add new spatial data source in Biospytial

Biospytial is a Knowledge Engine that merges different data in order to model aspects of Biodiversity using geostatistical framework. Biospytial is still in testing mode. This means that at this point, many features for easing the maintenance, provisioning and configuration are not developed yet. However, it is still a powerful tool to extract, analyse and model complex macro-ecological patterns.

Aims of this tutorial

In this tutorial you will find a rough guide on how to install new data sources. We will install a vector-based data source called: ‘global_ecoregions’ and raster-based data source: ‘World Population for Latin America”.

Assumptions

Here I will assume that you have a fully installed/running Biospytial Suite. i.e. the containers:

  • Geoprocessing-Backend (GBP)
  • Graph-Computing-Engine (GCE)
  • Biospytial-Client. (BPE)
  • The data sources downloaded and allocated in an accessible path from the Biospytial Client.

Converting the data to a Django Models

For handling the data, Biospytial uses the ORM model for accessing geospatial data stored in the Geoprocessing-Backend. For achieving this, we need to specify a Class called Model using a given datasource. In other words, each datasource needs to have a class specification. For communicating with the Relational Database manager.

Vector data

We will make use of the tool ‘ogrinspect’ to generate the model definition for a shapefile file and follow these steps.

  1. Login to Biosptial-Client session (the bash shell and not the ipython environment).
  2. Locate the path where the data is stored. In this case we are interested in adding the datasource ‘terr-ecoregions-TNC’ which has an ESRI-Shapefile format.

Ingest the shapefile into the GPB

  • We will make use of the ‘LayerMapping’ utility.
  • Use the tool ‘ogrinspect’ described in the manage.py module inside the folder ‘\apps’ where all the Biospytial source is located. The general syntaxis of this command is:
    python manage.py ogrinspect [options] [options]

For this example:

python manage.py ogrinspect /mnt/data1/maps/terr-ecoregions-TNC/tnc_terr_ecoregions.shp TerrEcoregions --srid=4326 --mapping --multi

Where:

  • –srid=4326 option sets the SRID for the geographic field.
  • –mapping option tells ogrinspect to also generate a mapping dictionary for use with LayerMapping.
  • –multi option is specified so that the geographic field is a MultiPolygonField instead of just a PolygonField.

More information in: (https://docs.djangoproject.com/en/2.0/ref/contrib/gis/tutorial/)

The command will print in the standard out (screen) the class definition for this dataset. If we decided to use the ‘–mapping’ option it will include as well a dictionary with a standarized format for the column names.

Export ShaPefile into the Database (Geoprocessing Container)

We will use the ‘LayerMapping’ utility to make this process faster. The first thing is to edit or create the file ‘load_shapefiles.py’ inside the ‘ecoregions’ app.

We will define here the mapping names dictionary (see above) and the necessary code to insert the shapefile into the database.

This is the content of the file ‘load_shapefile.py’ ”’

!/usr/bin/env python
-- coding: utf-8 --

from future import absolute_import, division, print_function, unicode_literals 
import os from django.contrib.gis.utils 
import LayerMapping from .models 
import TerrEcoregions from biospytial 
import settings 

""" Functions for exporting shapefiles into the Postgis Database. """

author = "Juan Escamilla Mólgora"
copyright = "Copyright 2018, JEM"
license = "GPL" 
mantainer = "Juan" 
email ="molgor@gmail.com"

#Generated by ogrinspect

terrecoregions_mapping = { 'eco_id_u' : 'ECO_ID_U', 'eco_code' : 'ECO_CODE', 'eco_name' : 'ECO_NAME', 'eco_num' : 'ECO_NUM', 'ecode_name' : 'ECODE_NAME', 'cls_code' : 'CLS_CODE', 'eco_notes' : 'ECO_NOTES', 'wwf_realm' : 'WWF_REALM', 'wwf_realm2' : 'WWF_REALM2', 'wwf_mhtnum' : 'WWF_MHTNUM', 'wwf_mhtnam' : 'WWF_MHTNAM', 'realmmht' : 'RealmMHT', 'er_update' : 'ER_UPDATE', 'er_date_u' : 'ER_DATE_U', 'er_ration' : 'ER_RATION', 'sourcedata' : 'SOURCEDATA', 'geom' : 'MULTIPOLYGON', }

file_shp = os.path.abspath( os.path.join(settings.PATH_RAWDATASOURCES, 'terr-ecoregions-TNC', 'tnc_terr_ecoregions.shp'), )

def run(verbose=True):
    lm = LayerMapping( TerrEcoregions, file_shp, terrecoregions_mapping,           transform=False, ) 
    lm.save(strict=True, verbose=verbose) '''

To load the layer, log into the Biospytial iPython environment:

python manage.py shell' from ecoregions import load_shapefiles load_shapefiles.run() '

Example 2: add Roads Layer

I’ve downloaded the roads in Shapefile format from: http://www.conabio.gob.mx/informacion/gis/maps/geo/carre1mgw.zip

Using ‘ogrinspect’ tool we have the following:

"""
This is an auto-generated Django model module created by ogrinspect.

from django.contrib.gis.db import models

class MexRoads(models.Model): 
    fnode_field = models.BigIntegerField()
    tnode_field = models.BigIntegerField()
    lpoly_field = models.BigIntegerField() 
    rpoly_field = models.BigIntegerField() 
    length = models.FloatField() 
    cov_field = models.BigIntegerField() 
    cov_id = models.BigIntegerField() 
    geom = models.MultiLineStringField(srid=4326)

#Auto-generated LayerMapping dictionary for MexRoads model

mexroads_mapping = { 'fnode_field' : 'FNODE_',
 'tnode_field' : 'TNODE_',
 'lpoly_field' : 'LPOLY_',
 'rpoly_field' : 'RPOLY_',
 'length' : 'LENGTH',
 'cov_field' : 'COV_',
 'cov_id' : 'COV_ID',
 'geom' : 'MULTILINESTRING'
 } """

 

We need to include this layer in an app. For ease of use.

We will load it in the ‘sketches’ app, i.e. sketches/models.py

Load the Shapefile into the postgis database using shp2psql (Alternative Method. It Works!)

These were the commands used. Inside the Postgis Container:

shp2pgsql -I -s 4326 carre1mgw.shp MexRoads > mex_roads.sql * psql -d biospytial -U biospytial -h localhost -f mex_roads.sql

Adding Raster data

We will use the raster support from postgis > 2.x . We will use the script: migrateToPostgis.bash inside: /apps/raster_api/bash_raster_tools/bash_scripts

However, the tools for ingesting data into the database are stored in the Geospatial Processing Container. We need to log into this container and run the above file.
You can copy the ‘bash_raster_tools’ inside this container and run the command ‘migrateToPostgis.bash’.

Example:

'/mnt/data1/bash_raster_tools/bash_scripts/migrateToPostgis.bash [RasterData.tif]'

After this, the raster data is already stored in the Database.

How to add a class definition for a Loaded Raster Layer

We need to explicitly add the Model /Class definition inside the file: ‘raster_api/models.py’

The base class is GenericRaster. We need to extend this class into a new definition.

Follow this example:

class DistanceToRoadMex(GenericRaster):
    """
    ..
    Abstract model for all the Distance to Road datasource.
    
    Attributes
    ==========
    I'll us the default attributes given by the raster2pgsql
    id : int Unique primary key
        This is the identification number of each element in the mesh.
    
    """
    number_bands = 1
    neo_label_name = 'Dist_to_road_mex'
    link_type_name = 'HAS_A_DISTANCE_OF'
    units = '(meters)'
    
    class Meta:
        managed = False
        db_table = 'dist_map_wgs84_clip'
    
    def __str__(self):
        c = "< Distance to Road Raster Data: %s >"
        return c
 

Add the new model into the ‘raster_models_dic’ dictionary:

raster_models_dic = {
'WindSpeed' : raster_models[7],
'Elevation' : raster_models[0],
'Vapor' : raster_models[6],
'MaxTemperature' : raster_models[5] ,
'MinTemperature' : raster_models[4] ,
'MeanTemperature' : raster_models[3] ,
'SolarRadiation' : raster_models[2], 
'Precipitation' : raster_models[1], 
'WorldPopLatam2010' : raster_models[8] ,
'DistanceToRoadMex' : raster_models[9],
}

The data is ready to be used in Biospytial.

Testing

See the ‘notebook inside’ the ‘raster_api’ module and run interactively 🙂