Chapter 5 Estimating the Marginal Likelihood via Importance Sampling (IS2)

In chapter 3, we outlined two model comparison methods (i.e. graphical and information criterion) to determine the “best fitting” models for the Forstmann dataset. However, the current gold standard for model selection/comparison is to use mariginal likelihood with Bayes factors. In this chapter we will illustrate how you can estimate marginal likelihood via importance sampling squared (IS2). The IS2 method is a robust estimation method that accounts for model flexibility and provides unbiased estimates of the marginal likelihood. Marginal likelihood estimates allow you to assess the fit of a model and the model’s flexibility by integrating the likelihood across the prior predictive space of a model. In hierarchical models, obtaining the marginal likelihood is difficult, as the likelihood function is the density of the data with the random effects integrated out when viewed as a function of the group-level parameters; an integral which is often unavailable (computationally or as it is intractable). Despite this integral being intractable, IS2 allows a method of estimating the marginal likelihood when the likelihood is intractable but can be estimated in unbiasedly.

The method works by first sampling from the posterior via a sampling scheme such as MCMC (or here, PMwG). These draws are then used to create the importance distribution for the fixed parameters. This importance distribution is constructed by fitting a mixture of normal or Student t distributions to these MCMC samples. We then construct conditional proposal parameters - called particles - for each subject. The marginal likelihood is then estimated unbiasedly which is combined with the importance distribution. From this method, the importance sampling procedure is in itself an importance sampling procedure which can be used to estimate the likelihood.

5.1 Using IS2 with the Forstmann dataset

Here we will demonstrate how to use the IS2 algorithm to compare models from the Forstmann example in Chapter 3. In the example shown here, we use samples taken from the two Forstmann (2008) models shown in chapter 3. We use samples from the PMwG posterior sampling stage (the sampled object). IS2 is robust enough to be able to use samples from any stage of PMwG; however, we recommend sampling from the posterior for lower variance.

In this example we run IS2 to compare the four models in an unbiased manner. The IS2 paper uses the same method, and so this chapter provides a walkthrough of that example. As we are using samples from the PMwG algorithm, the prior_dist function is specifically coded for PMwG priors. To run other models, the prior_dist function needs to be updated (more on this later maybe).

5.1.2 Set up variables

Next we set up the variables to be used by the algorithm. With PMwG these can remain as specified here. Essentially the algorithm needs the number of subjects, random effects, iterations of samples, number of required IS2 samples, number of IS2 particles and the parameter names. Here, for convenience we use 1000 samples and 250 particles. It is often more reliable to run a larger number of samples and particles, however, this decreases efficiency. We recommend reading blog post for more information. We also recommend running the IS2 algorithm for several iterations and combining the IS2 samples output to achieve more stable estimates.

5.1.3 Store the samples

In this next step, we store the outputs from PMwG to be used in the IS2 algorithm. This leads to us creating X, an array of all parameters, random effects, off diagonal covariances and a-half values used in the PMwG sampling by the length of posterior samples.

# Store the random effects from the sampled stage of PMwG object
alpha <- sampled$samples$alpha[,, sampled$samples$stage == "sample"]

# Store theta mu from the sampled stage of PMwG object
theta <- sampled$samples$theta_mu[, sampled$samples$stage == "sample"]

# Store the cholesky transformed sigma from the sampled stage of PMwG object
sig <- sampled$samples$theta_sig[,, sampled$samples$stage == "sample"]

# The a-half is used when calculating the Huang and Wand (2013) prior. 
# The 'a' is a random sample from the inverse gamma which weights the inverse Wishart. The mix of inverse Wisharts is the prior on the correlation matrix
a_half <- log(sampled$samples$a_half[, sampled$samples$stage == "sample"])



# unwind function 
unwind <- function(x, reverse = FALSE) {

  if (reverse) {
    n <- sqrt(2 * length(x) + 0.25) - 0.5 ## Dimensions of matrix
    out <- array(0, dim = c(n, n))
    out[lower.tri(out, diag = TRUE)] = x
    diag(out) = exp(diag(out))
    out = out%*%t(out)
  } else {
    y = t(chol(x))
    diag(y) = log(diag(y))
    out = y[lower.tri(y,diag=TRUE)]
  }
  return(out)
}

#unwound sigma
pts2.unwound <- apply(sig, 3, unwind)

n.params <- nrow(pts2.unwound) + n_randeffect + n_randeffect
all_samples <- array(dim = c(n_subjects, n.params, n_iter))
mu_tilde <- array(dim = c(n_subjects, n.params))
sigma_tilde <- array(dim = c(n_subjects, n.params, n.params))


for (j in 1:n_subjects){
  all_samples[j,,] <- rbind(alpha[,j,], theta[,], pts2.unwound[,])
  # Calculate the mean for re, mu and sigma
  mu_tilde[j,] <- apply(all_samples[j,,], 1, mean)
  # Calculate the covariance matrix for random effects, mu and sigma
  sigma_tilde[j,,] <- cov(t(all_samples[j,,]))
}

X <- cbind(t(theta), t(pts2.unwound), t(a_half)) 

5.1.5 Generate proposal parameters from importance samples

Now that we have our importance distribution, we can generate proposals from this. Here, we protect against low amounts of samples by including the pmax and pmin arguments, ensuring that even if the weights are low, that we do sample from the both parts of the mixture.

5.1.7 Write a prior distribution function

This function is used in PMwG to calculate the density under the prior. Here we use Huang and Wand’s (2013) prior (as used in PMwG) for a multivariate normal. The final line shows the value that is returned, which is equation x in the paper. This takes the density of the current parameters under the prior mean (log_prior_mu), variance (log_prior_sigma) and variance on the variance (log_prior_a). There are several other calculations performed here, which can be found in the paper.

5.1.8 Write a get_logp function

Next we need to create the function used to calculate the density of each particle. This is Equation 5 in the paper. This function calculates the density of each proposal for each subject across n particles. Here we first generate the particles from a mix of the group level parameters and a conditional multivariate normal, conditioned on the current subjects mean and variance. We then obtain the density of these proposed parameters under the likelihood and the group_dist functions over the likelihood of these proposals (as per equation 5). We then protect against badness and return the sum of these values (summed across participants, where participants are summed across particles)

get_logp <- function(prop_theta,
                     data,
                     n_subjects,
                     n_particles,
                     n_randeffect,
                     mu_tilde,
                     sigma_tilde,
                     i,
                     group_dist = group_dist)
  {
  #make an array for the density
  logp <- array(dim = c(n_particles, n_subjects))
  # for each subject, get 1000 IS samples (particles) and find log weight of each
  for (j in 1:n_subjects){
    #generate the particles from the conditional MVnorm AND mix of group level proposals
    wmix <- 0.95
    n1 <- rbinom(n = 1, 
                 size = n_particles,
                 prob = wmix)
    if (n1 < 2) n1 <- 2
    if (n1 > (n_particles - 2)) n1 <- n_particles - 2 ## These just avoid degenerate arrays.
    n2 <- n_particles - n1
    # do conditional MVnorm based on the proposal distribution
    conditional <- condMVNorm::condMVN(mean = mu_tilde[j,],
                                       sigma = sigma_tilde[j,,],
                                       dependent.ind = 1:n_randeffect,
                          given.ind = (n_randeffect + 1):n.params,
                          X.given = prop_theta[i, 1:(n.params-n_randeffect)])
    particles1 <- mvtnorm::rmvnorm(n1, 
                                   conditional$condMean,
                                   conditional$condVar)
    # mix of proposal params and conditional
    particles2 <- group_dist(n_samples = n2,
                             parameters = prop_theta[i,],
                             sample = TRUE,
                             n_randeffect = n_randeffect)
    particles <- rbind(particles1, particles2)
    
    for (k in 1:n_particles){
      x <- particles[k,]
      #names for ll function to work
      #mod notes: this is the bit the prior effects
      names(x) <- pars
      #   do lba log likelihood with given parameters for each subject, gets density of particle from ll func
      logw_first <- sampled$ll_func(x, 
                                    data = data[as.numeric(factor(data$subject))==j,]) #mod notes: do we pass this in or the whole sampled object????
      
      # below gets second part of equation 5 numerator ie density under prop_theta
      # particle k and big vector of things
      logw_second <- group_dist(random_effect = particles[k,],
                                parameters = prop_theta[i,],
                                sample= FALSE,
                                n_randeffect = n_randeffect) #mod notes: group dist
      # below is the denominator - ie mix of density under conditional and density under pro_theta
      logw_third <- log(wmix*dmvnorm(particles[k,],
                                     conditional$condMean,
                                     conditional$condVar) + 
                          (1-wmix) * exp(logw_second)) #mod notes: fine?
      #does equation 5
      logw = (logw_first + logw_second) - logw_third
      #assign to correct row/column
      logp[k,j] = logw 
    }
  }
  #we use this part to centre the logw before addign back on at the end. This avoids inf and -inf values
  sub_max = apply(logp, 2, max)
  logw = t(t(logp) - sub_max)
  w = exp(logw)
  subj_logp = log(apply(w, 2, mean)) + sub_max #means
  
  # sum the logp and return 
  return(sum(subj_logp))
}