Tutorial 3 - Model Comparisons and Robustness

Wataru Toyokawa and Charley Wu

2023-07-01

library(tidyverse)
library(cowplot)
library(cmdstanr)
options(dplyr.summarise.inform = FALSE)
set.seed(1234) #set seed for reproducibility

#Some generic functions
softmax <- function(beta, Qvec){
  p <- exp(beta*Qvec)
  p <- p/sum(p) #normalize to sum to 1
  return(p)
}
softmaxVec <- function(betaVec, Qvec){
  p_exp <- exp(betaVec*Qvec)
  p_sum <- p_exp %>% apply(2, sum)
  p <- p_exp/p_sum #normalize to sum to 1
  return(p)
}
logisticFunc = function (alphaRaw) {
    return( 1 / (1 + exp(-alphaRaw)))
}

This is Tutorial 3 of COSMOS Konstanz. The goal of this tutorial is to teach you to perform model comparisons, and to both interpret and test the robustness of your results.

Comparing models

Compared to verbal theories, computational models allow us a much higher degree of precision in describing a hypothesis. Computational models require making commitments to explicit mechanisms of learning and decision-making that are hypothesized to produce the observed behavior. Through model comparison, we can pit different theories against one another in a competition, to see which theory best captures the empirical data. We can then examine the parameters of the best model to provide interpretations that the raw behavioral alone is unable to provide.

An important consideration in any model comparison is to ensure we have a fair competition. One of the most common problems is overfitting, which we will dive into first.

Overfitting

From Gigerenzer and Brighton (2009) Imagine we are trying to model the temperature in London as a function of the calendar date. To describe this relationship, we can consider regression models using polynomials of different degrees. Models using higher degree polynomials add additional complexity, since is increased flexibility in the types of relationships that can be learned and a larger number of parameters (e.g., weights) to be estimated.

On the left, we show the fits of two models, using either degree 3 or degree 12 polynomials. Both seem to provide similar trends, although the more complex model has more variability. On the right, we compare fitting vs. prediction error, where we take 30 random data points to fit each model, and then try to predict out-of-sample for the rest of the data. As the complexity of a model increases (represented by increasing the degree of the polynomial), it begins to overfit the data, and to lose predictive ability. Overly complex models, in general, will overfit, and thus have larger predictive error. However, if the model is too simple, it will fail to fit at all, and also have high predictive error.

Thus, there is a trade-off between fitting (describing the data we have access to) and predicting (describing unobserved data). This trade-off is an integral part of how we do model comparison and measure goodness of fit, which we discuss further in next section.

Goodness of fit

The essence of most model comparison methods is to define a goodness of fit measure that quantifies how well a model describes the data, while also accounting for overfitting. Goodness of fit measures

These goodness of fit measures can be characterized as belonging to either a Maximum likelihood or a Bayesian model selection framework (columns). We explain this distinction in more detail below. However in practice, a modeler more often needs to decide whether they can justify the simplest approach of penalizing for the number of parameters or if they need additional robustness afforded by actively testing prediction error using cross validation or using MCMC sampling to invoke Bayesian Occam’s razor, which achieves a natural balance between fit and simplicity.

We will walk through each of these approaches, and start by explaining the theoretical distinction between maximum likelihood and Bayesian methods.

Maximizing likelihood or Bayesian model selection

Different approaches to model fitting and model comparison broadly fall under either a Maximum likelihood estimation (MLE) or a Bayesian model selection framework.

Maximum likelihood quantifies the goodness of fit for a single set of parameter values \(\hat{\theta}\) under which the model provides the best fit to the observed data. The best parameter values \(\hat{\theta}\) (i.e., the MLE), are typically found using an optimization algorithm. Overfitting is then avoided by penalizing for the number of parameters (e.g., AIC) or using cross-validation to actively separate the data into a training set and a test set (more details below).

In contrast, the goal of the Bayesian framework is to describe the evidence for model \(m_1\) against an alternative \(m_2\): \(\frac{P(D|m_1)}{P(D|m_2)}\). This comparison uses the marginal likelihood \(P(D|m) =\int P(D|m,\theta)P(\theta|m)d\theta\), which integrates across all possible parameter combinations. This provides a natural penalty for more complex models, since the fit of the model \(P(D|m)\) is computed across the entire range of flexibility that the parameters allow for. However, computing the marginal likelihood is typically intractable in most cases, and is usually approximated with the BIC or via MCMC sampling.

While maximum likelihood and Bayesian methods have different interpretations, the simplest methods of computing AIC or BIC are quite similar in practice. Both approaches simply compute the MLE and explicitly penalize for complexity based on the number of parameters. We will start first by explaining these approaches.

Penalizing for complexity

AIC and BIC are goodness of fit measures based on computing a maximum likelihood estimate (MLE), although the latter has Bayesian interpretations. Both approaches start by computing the set of parameters that maximize the likelihood of the data and then add a penalty for complexity with the simple assumption that complexity can be expressed as a function of the number of parameters in a model: the more parameters, the more complexity it has. Of course, this doesn’t always hold when comparing models with different structural forms. But it is a reasonable starting point, and this simplicity makes AIC and BIC very common methods.

Maximum likelihood estimation

We already described how to compute a MLE in the previous tutorial, which is the set of parameters \(\hat{\theta}\) that maximizes the probability of the observed data \(P(D|m,\theta)\). We slightly augment our notation from before to clarify that the MLE is with respect to a specific model \(m\), since we will be considering multiple models (e.g., \(m_1, m_2, \ldots, m_k\))

Let’s start by loading in some simulated from the Tutorial 2 Q-learning model and treat this as if it were participant data for now.

simDF <- readRDS('data/simChoicesQlearning.Rds') #load data that was generated in Tutorial 2
simDF <- subset(simDF, chosen==1) #only include chosen actions
simDF$action <- factor(simDF$action) #factorize choices into discrete options
roundDF <- subset(simDF, round == 1) #plot one illustrative round

#Plot choices and rewards of agent 1 based on social info
ggplot(subset(roundDF, agent==1), aes(x = trial, y = reward, color = action))+
  geom_point(data = subset(roundDF, agent == 1), aes(x = trial, y = -11, color = action))+ #agent choices
  geom_text(x=3, y = -9, label='Agent Choices', color = 'black')+
  geom_point(data = subset(roundDF, agent != 1), aes(x = trial, y = agent+10, color = action))+ #socially observable choices
  geom_text(x=5, y = 16, label='Socially Observable Choices', color = 'black')+ #reward
  stat_summary(fun = mean, geom='line', color = 'black', size = 1)+
  coord_cartesian(ylim=c(-12,16))+
  theme_classic()+
  xlab('Trial') +
  ylab('Reward')+
  scale_color_viridis_d(name = 'option', drop=F)+
  ggtitle('Q-Learning agent')+
  theme(legend.position='right')
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Simulated data from a Q-learning agent. The black line shows rewards earned over trials. The colored dots on the bottom indicate the pattern of choices by the agent, while the dots at the top indicate the pattern of socially observable choices made by the other agents.

Simulated data from a Q-learning agent. The black line shows rewards earned over trials. The colored dots on the bottom indicate the pattern of choices by the agent, while the dots at the top indicate the pattern of socially observable choices made by the other agents.

The plot above shows the data we want to provide to our model, where specifically, we want to describe the sequence of choices made by the agent (dots at the bottom).

In order to describe how the participant/agent makes these choices and learns from the environment, we want to provide the model with the same information that your participants had access to in the task. To describe the mechanisms of individual learning, the model has access to the reward observations for each choice (black line). And since we also want to describe social learning mechanisms, we may additionally provide the model with the socially observable choices made by other agents (dots at the top). Here, this data is generated by a Q-learning model, which simply ignores social information. Yet, other models describe explicit mechanisms where social information influences individual choices (e.g., decision-biasing or value shaping).

We can now compute the MLE for the correct underlying Q-learning model, an incorrect value shaping model, or any model we might want to consider.

MLE using Q-learning model:

#Modify the simulated data

#Q learning likelihood function
Qlikelihood <- function(params, data, Q0=0, k=10){ #We assume that prior value estimates Q0 are fixed to 0 and not estimated as part of the model
  names(params) <- c('alpha', 'beta') #name parameter vec
  nLL <- 0 #Initialize negative log likelihood
  rounds <- unique(data$round)
  trials <- max(data$trial)
  for (r in rounds){ #loop through rounds
    Qvec <- rep(Q0,k) #reset Q-values each new round
    for (t in 1:trials){ #loop through trials
      p <- softmax(params['beta'], Qvec) #compute softmax policy
      trueAction <- subset(data,trial==t & round == r)$action
      negativeloglikelihood <- -log(p[trueAction]) #compute negative log likelihood
      nLL <- nLL + negativeloglikelihood #update running count
      Qvec[trueAction] <- Qvec[trueAction] + params['alpha']*(subset(data, trial==t & round == r)$reward - Qvec[trueAction]) #update q-values
    }
  }
  return(nLL)
}

#let's optimize the parameters for the Q-learning model
init <- c(1,1) #initial guesses
lower <- c(0,0) #lower and upper limits
upper <- c(1,Inf)

MLE_q <- optim(par=init, fn = Qlikelihood,lower = lower, upper=upper, method = 'L-BFGS-B', data = subset(simDF,agent == 1 ))

MLE using VS model:

#VS likelihood function
VSlikelihood <- function(params, data, targetAgent=1, Q0=0, k=10, nAgents = 4){ #We assume that prior value estimates Q0 are fixed to 0 and not estimated as part of the model
  names(params) <- c('alpha', 'beta', 'eta') #name parameter vec
  nLL <- 0 #Initialize negative log likelihood
  rounds <- unique(data$round)
  trials <- max(data$trial)
  for (r in rounds){ #loop through rounds
    Qvec <- rep(Q0,k) #reset Q-values each new round
    socialObs <- rep(NA, nAgents) #social observations
    for (t in 1:trials){ #loop through trials
      socialFreq <- table(factor(socialObs[-targetAgent], levels = 1:k)) + .0001 #Frequency of each socially observed action + a very small number
      Qvec <- Qvec+ (params['eta']*socialFreq) #value bonus
      p <- softmax(params['beta'], Qvec) #compute softmax policy
      trueAction <- subset(data,agent == targetAgent & trial==t & round == r)$action
      negativeloglikelihood <- -log(p[trueAction]) #compute negative log likelihood
      nLL <- nLL + negativeloglikelihood #update running count
      Qvec[trueAction] <- Qvec[trueAction] + params['alpha']*(subset(data, agent == targetAgent & trial==t & round == r)$reward - Qvec[trueAction]) #update q-values
    }
  }
  return(nLL)
}

#let's optimize the parameters for the VS model
init <- c(1,1,1) #initial guesses
lower <- c(0,0, 0) #lower and upper limits
upper <- c(1,Inf, Inf)

MLE_VS <- optim(par=init, fn = VSlikelihood,lower = lower, upper=upper, method = 'L-BFGS-B', data = simDF)

Let’s quickly examine these results. The original parameters of the agent were \(\theta = \{\alpha = .9, \beta = 1\}\). The MLE of the (correct) Q-learning model returned \(\hat{\theta}_{\text{Q-learning}} = \{\hat{\alpha} =0.947, \hat{\beta} = 1.096\}\), while the MLE of the (mismatched) value shaping model returned \(\hat{\theta}_{\text{VS}} = \{\hat{\alpha} =0.947, \hat{\beta} = 1.096, \hat{\eta} = 1.000\}\).

Exercise 3-1. Aside from only collecting more data for performing model estimation, the accuracy of the MLE can also be improved using a different choice of optimization algorithm or running multiple replications from different initialization points. In general, optimization algorithms are not guaranteed to converge on the best set of parameters. Can you modify the optimization procedure to get better MLE estimates?

We can describe the MLE fits of both models in terms of their negative log likelihood \(nLL = - \log P(D|\hat{\theta})\), where a lower value (i.e., lower log loss) suggests a better fit. \(nLL_\text{Q-learning} = 109.565902\) vs. \(nLL_\text{VS} = 109.568116\). We see only slight differences here. However, comparing only the nLL fails to account for overfitting. Note that a VS model can also mimic a Q-learning model when \(\eta \rightarrow 0\). Thus, we can compute the AIC or BIC by penalizing for the number of parameters.

Akaike Information Criterion (AIC)

AIC provides an estimate of the predictive accuracy of a model, by adding a penalty for the number of parameters \(k\): \[ \begin{equation} \text{AIC} = -2 \log P(D|\hat{\theta} ) + 2k \end{equation} \] The first term is simply twice the nLL (sometimes known as the deviance), while the second term adds an additional loss proportional to the number of parameters. As with nLL, lower values indicate better results, since it is a measure of the error or loss.

The definition of AIC comes from information theory, and describes the amount of information that is shared between some fixed reality and an imperfect model. A more down-to-earth interpretation is that AIC is an approximation of leave-one-out cross validation error, with an exact equivalence for linear regression and mixed-effects regression models in the limit of infinite data (Stone 1977). However, AIC is also commonly used for models that are not linear regression models and certainly always in settings with finite data. Thus, AIC can more practically be regarded as one of the more lax goodness of fit measures we describe, which is more prone to prefer overly complex models.

Bayesian Information Criterion (BIC)

BIC is an alternative method for penalizing for complexity. While it has a Bayesian interpretation, in practice, computing BIC looks very similar to AIC:

\[ \begin{equation} \text{BIC} = -2 \log P(D|\hat{\theta}) + k \log(n) \end{equation} \]

\(k\) is the number of parameters as before, but we also multiply this by the log of the number of data points \(n\). Theoretically, BIC provides an approximation of the marginal likelihood under specific assumptions about the prior distribution over parameters \(P(\theta)\). However, this interpretation is not without controversy and the assumptions are hardly ever unpacked (see Ch.11 of Lewandowsky and Farrell 2010 for a discussion). Yet in practice, we can simply consider BIC to be a more strict approach to penalizing for complexity compared to AIC, such that an overfit model is less likely to come out on top.

Let’s now run the model fitting procedure for each of the 4 simulated agents, and compare AIC vs. BIC.

fitDF <- data.frame() #initialize data frame
for (targetAgent in c(1,2,3,4)){
  #Q learning
  MLE_q <- optim(par=c(1,1), fn = Qlikelihood,lower =  c(0,0), upper= c(1,Inf), method = 'L-BFGS-B', data = subset(simDF,agent == targetAgent)) #estimate parameters
  qDF <- data.frame(id=targetAgent, nLL = MLE_q$value, alpha = MLE_q$par[1], beta = MLE_q$par[2], eta = NA, model = 'Q-learning', k=2) #put results into a data frame
  #Value shaping
  MLE_VS <- optim(par= c(1,1,1), fn = VSlikelihood,lower = c(0,0, 0), upper= c(1,Inf, Inf), targetAgent =targetAgent, method = 'L-BFGS-B', data = simDF) #estimate parameters
  vsDF <- data.frame(id=targetAgent, nLL = MLE_VS$value, alpha = MLE_VS$par[1], beta = MLE_VS$par[2], eta = MLE_VS$par[3], model = 'Value Shaping', k=3) #put results into a dataframe
  fitDF <- rbind(fitDF, qDF, vsDF) #combine together
}

#compute AIC and BIC
fitDF$AIC <- (2*fitDF$nLL) + (2* fitDF$k)
fitDF$BIC <- (2*fitDF$nLL) + (fitDF$k * log(nrow(subset(simDF, agent == 1)))) 

fitDF <- fitDF %>% pivot_longer(cols = AIC:BIC, values_to = "fit", names_to = 'metric') #pivot to longer data frame for plotting

#Plot data
ggplot(fitDF, aes(x = model, y = fit, color = factor(id)))+
  theme_classic()+
  geom_line(aes(group=id), color = 'black', alpha = 0.2) + #lines connecting participants
  geom_point()+ #dot per participant
  facet_wrap(~metric) +
  labs(x = 'Model', y = 'Goodness of Fit (lower is better)')+
  scale_color_brewer(palette = 'Dark2',  name = 'agent')+
  theme(legend.position='right', strip.background = element_blank())
Model comparison with 4 simulated Q-learning agents. Each dot is single simulated agent, while the connecting line shows the relative difference in either AIC or BIC between models.

Model comparison with 4 simulated Q-learning agents. Each dot is single simulated agent, while the connecting line shows the relative difference in either AIC or BIC between models.

While there is substantial variability between simulated agents, we see that the true generating Q-learning model tends to produce better fits (lower values). This is much more the case when using BIC, which penalizes more heavily for complexity. A sample size of only 4 agents is too small to perform statistical analyses about whether the advantage for the Q-learning model is meaningful. Yet, these preliminary results provide early evidence that the correct Q-learning model can be recovered.

Exercise 3-2. Can you run this model comparison for a larger sample of simulated agents? You can use the code for simulating data from Tutorial 2. How many simulated agents do you need in order to achieve statistical power for determining which is the winning model? How does this differ between AIC and BIC?

Next, let’s explore how cross-validation more directly avoids the perils of overfitting through out-of-sample predictions

Cross validation

Here, we will introduce leave-one-round-out cross validation. Simply put, we iteratively hold out a single round of the task from the training set, forming a test set (outer loop). We then use the same MLE approach to estimate our parameters on the training set (inner loop). However, rather than evaluating the fit of the model on the same training data, we use the MLE parameters to make out-of-sample predictions on the test set (i.e., the held out round). We then sum the negative log likelihoods of each iteration (i.e., each cross validation slice) and compare models solely on their out-of-sample predictive accuracy. It is not necessary to add explicit penalization for model complexity, since the generalizability of the model is being directly tested on new data.

Inner and outer loops of cross validattion

cvDF <- data.frame() #initialize data frame
for (targetAgent in c(1,2,3,4)){
  qDF <- data.frame() #intialize dataframes for each model
  vsDF <- data.frame()
  for (leaveOutRound in 1:10){#Cross validation loop
    #Q-learning
    MLE_q <- optim(par=c(1,1), fn = Qlikelihood,lower =  c(0,0), upper= c(1,Inf), method = 'L-BFGS-B', data = subset(simDF,agent == targetAgent & round != leaveOutRound)) #estimate parameters on training set
    qDF <- rbind(qDF, data.frame(id=targetAgent, leaveOut = leaveOutRound, nLL = Qlikelihood(params = MLE_q$par,  data = subset(simDF,agent == targetAgent & round == leaveOutRound)), alpha = MLE_q$par[1], beta = MLE_q$par[2], eta = NA, model = 'Q-learning')) #Store the predictions of the CV slice and the MLE params
    #Value shaping
    MLE_VS <- optim(par= c(1,1,1), fn = VSlikelihood,lower = c(0,0, 0), upper= c(1,Inf, Inf), targetAgent =targetAgent, method = 'L-BFGS-B', data = subset(simDF, round !=leaveOutRound)) #estimate
    vsDF <- rbind(vsDF, data.frame(id=targetAgent, leaveOut = leaveOutRound, nLL = VSlikelihood(params = MLE_VS$par, targetAgent =targetAgent, data = subset(simDF, round == leaveOutRound)), alpha = MLE_VS$par[1], beta = MLE_VS$par[2], eta =  MLE_VS$par[3], model = 'Value Shaping' )) 
  }
  cvDF <- rbind(cvDF, qDF, vsDF) #combine together
}

#We now need to aggregate our results over each cross validation slice, summing up the nLL to compute the overall nLL. For the parameter estimates, we get a different MLE for each iteration. We can simply compute the mean or median to get a potentially more robust estimate than with a single MLE over all the data.
cvDF <- cvDF %>% group_by(id, model) %>% summarize(cv_nLL = sum(nLL), alpha = mean(alpha), beta = mean(beta), eta = mean(eta))

#Plot data
ggplot(cvDF, aes(x = model, y = cv_nLL, color = factor(id)))+
  theme_classic()+
  geom_line(aes(group=id), color = 'black', alpha = 0.2) + #lines connecting participants
  geom_point()+ #dot per participant
  labs(x = 'Model', y = 'Cross Validation Loss (lower is better)')+
  scale_color_brewer(palette = 'Dark2',  name = 'agent')+
  theme(legend.position='right', strip.background = element_blank())

Here, the results are a bit more ambiguous. Since CV uses a smaller amount of data points for each MLE, these results might suggest you should add more rounds to the task or we would need to modify the parameters of the task to encourage more divergent model predictions. Ultimately, the purpose of using computational models is to provide new interpretations of behavioral data. The validity of these interpretations hinge on the reliability and robustness of our results. The last section of the tutorial will focus on methods for testing for robustness, but let’s now provide a quick introduction one of the most reliable methods of model estimation using MCMC sampling.

Exercise 3-3. How well do the cross-validated parameter estimates match the true generating parameters? How does this compare to the MLEs? Since cross validation provides a different estimate for each slice of the data, how much variability is there between slices? Does a mean or a median provide a better aggregate statistic?

Bayesian model estimation with MCMC sampling

We provide only a brief overview of Bayesian methods here, but resources such as Lewandowsky and Farrell (2010) and McElreath (2020) provide an excellent source for more in-depth study.

Motivations

To begin, let’s address the main motivations behind using Bayesian model estimation.

First of all, Bayesian methods provide a probability distribution over parameters rather than single point estimate via MLE. This provides a more accurate picture of which parameters best describe human behavior and is more robust to issues such as the presence of outliers and bimodality in the data. Secondly, we are able to model hierarchical relationships allowing us to infer the characteristics of a general population from a sample of participants, each of whom may have individual differences in how they learn and make decisions. We can also hierarchically model within-subject manipulations, to describe how experimental manipulations change individual parameter estimates, even if there is a lot of variability in individual baselines. Lastly, the Bayesian approach also provides a natural solution to overfitting via Bayesian Occam’s Razor, since more complex models also have increased flexibility. This increased flexibility is penalized during Bayesian inference, since we evaluate the model across the entire range of parameter values.

Yet, for all these advantages, exact Bayesian inference is typically computational intractable. This motivates why we use Markov Chain Monte Carlo (MCMC) sampling to provide an approximation of an analytically intractable probability distribution using a collection of sampled data points.

MCMC sampling

MCMC combines two basic concepts:

  1. A Markov Chain is a sequential process, where each random sample is used as a stepping stone to generate the next sample. By choosing the appropriate transition dynamics, the Markov Chain will gradually match the target distribution we are trying to approximate as it reaches equilibrium.
  2. Monte Carlo sampling is a foundational concept in statistics named after a famous casino in Monaco. The basic idea is that a large enough number of randomly drawn samples will approximate the underlying distribution. For instance, flipping a coin enough times will provide an approximation of the underlying probability of heads vs. tails.

These two process are combined together to perform Bayesian inference about the entire posterior distribution over parameters:

\[ P(\theta|D,m) \propto P(D|\theta,m) P(\theta,m) \] Previously, we only used used a maximum likelihood point estimate of the parameters \(\hat{\theta}\). Here, we are interested in estimating a full distribution of parameters \(P(\theta|D,m)\). This is sometimes simply known as the posterior, since it is a function of the prior distribution \(P(\theta)\) (i.e., our initial guess about the parameters before we see the data) and the likelihood of our observed data \(P(D|\theta,m)\) (i.e., how well a given model and set of parameters predicts the data).

In this illustration below, we illustrate the Metropolis-Hastings algorithm for performing MCMC, which is one of the simplest implementations. We first sample an initial set of parameters from the prior distribution \(\theta^0\sim P(\theta)\). Then, we iteratively sample some new parameters from a proposal distribution \(\theta^i \sim P(\theta^i|\theta^{i-1})\). This step can be as simple as adding some Gaussian noise to the previous sample \(\theta^i \sim \mathcal{N}(\theta^{i-1}, \sigma)\). Each new sample is considered as a candidate, which we accept with a probability proportional to whether it provides a better account of the data (i.e., whether the likelihood of the new sample \(P(D|\theta^{i},m)\) is higher than the old sample \(P(D|\theta^{i-1},m)\)). Thus, we are more likely to accept new samples that have a higher likelihood (filled in circles) and less likely to accept samples with lower likelihood (empty circles). We omit some details for simplicity, but in the end, we have a collection of samples (dots) that approximates the posterior distribution (histogram).

An overview of MCMC sampling, adapted from Lee, Sung, and Choi (2015)

Introduction to STAN

To give you a concrete example of how to build and fit an MCMC model, we use STAN to describe a Q-learning model, we use cmdstanr as an interface to R to fit the model.

The code block below describes our Q learning model, where we are fitting group-level estimates about the \(\alpha\) and \(\beta\) parameters. For more detailed tutorials on STAN, please consider the following resources: https://mc-stan.org/users/documentation/tutorials.

// Q_learning model
data {
  int<lower = 0> All;  // number of observations
  int<lower = 0> Nsub;  // number of subjects
  int<lower = 0> Ncue;  // number of options
  int<lower = 0> Ntrial;  // number of trials per subject
  int<lower = 0> Nround;  // number of rounds per subject
  array[All] int<lower = 0> sub; // subject index
  array[All] int<lower = 0> Y; // index of chosen option: 0 => missing ()
  array[All] int<lower = 0> trial; // trial number
  array[All] int<lower = 0> thisRound; // round number
  array[All] real payoff; // payoff
}

parameters {
  real mu_alpha;
  real mu_beta;
  real<lower=0> s_alpha;
  real<lower=0> s_beta;
  
  // Varying effects clustered on individual, defined as deviations from grand mean
  // We do the non-centered version with Cholesky factorization
  matrix[2,Nsub] z_ID;               //Matrix of uncorrelated z - values
  vector<lower=0>[2] sigma_ID;       //SD of parameters among individuals
  cholesky_factor_corr[2] L_Rho_ID;    // This is the Cholesky factor: if you multiply this matrix and it's transpose you get correlation matrix

}

transformed parameters {
  matrix[Nsub,2] v_ID; // Matrix of varying effects for each individual for alpha, beta
  array[Nsub, Nround] matrix[Ncue, Ntrial] Q;  // Q-values for each target
  array[Nsub, Nround] matrix[Ncue, Ntrial] q; // softmax choice (log-)probability transformed from Q values
  vector[Nsub] logit_alpha;  // learning rate -- logit scale
  vector[Nsub] log_beta;  // softmax parameter -- log scale
  vector<lower = 0, upper = 1>[Nsub] alpha;  // learning rate
  vector<lower = 0>[Nsub] beta;  // inverse temperature (softmax)
  
  v_ID = ( diag_pre_multiply( sigma_ID , L_Rho_ID ) * z_ID )'; // the ' on the end of this line is a transposing function
  for(i in 1:Nsub) {
    logit_alpha[i] = mu_alpha + s_alpha * v_ID[i, 1];
    log_beta[i] = mu_beta + s_beta * v_ID[i, 2];
    beta[i] = exp(log_beta[i]);
    alpha[i] = 1/(1+exp(-logit_alpha[i]));
  }

  for(idx in 1:All) {
    // SETTING INITIAL Q-VALUE
    if(trial[idx] == 1)
      Q[sub[idx], thisRound[idx]][1:Ncue, trial[idx]] = rep_vector(0, Ncue);
    // CHOICE PROBABILITY
    q[sub[idx], thisRound[idx]][1:Ncue, trial[idx]] = 
      log_softmax( 
        Q[sub[idx], thisRound[idx]][, trial[idx]] * beta[sub[idx]] 
        );

    // Q-VALUE UPDATE
    if(trial[idx] < Ntrial) {
      Q[sub[idx], thisRound[idx]][, (trial[idx]+1)] = Q[sub[idx], thisRound[idx]][, trial[idx]]; // copy Q(t) to Q(t+1)
      // update Q on the chosen option
      if(Y[idx]>0) { // if Y == 0, it means no choise was made (missed trials)
        // Updating chosen option's Q-value
        Q[sub[idx], thisRound[idx]][Y[idx], (trial[idx]+1)] =
              (1-alpha[sub[idx]])*Q[sub[idx], thisRound[idx]][Y[idx], trial[idx]] + alpha[sub[idx]]*payoff[idx];
      }
    }
   
  }
}

model {
  // position and scales of the learning parameters
  mu_alpha ~ std_normal(); // prior
  mu_beta ~ normal(0, 5); // prior
  s_alpha ~ exponential(1); //std_normal(); // prior
  s_beta ~ exponential(1); //std_normal(); // prior
  
  // Varying effects
  to_vector(z_ID) ~ std_normal();
  sigma_ID ~ exponential(1);
  L_Rho_ID ~ lkj_corr_cholesky(2); // A weakly informative prior that accounts for the possibility that alpha and beta are correlated

  for(idx in 1:All) {
    if( Y[idx] > 0 ) {
      target += categorical_lpmf( Y[idx] | exp(q[sub[idx], thisRound[idx]][,trial[idx]]) );
    }
  }
}

generated quantities {
  vector[Nsub] log_lik;
  vector[All] reward_pred_err;
  matrix[2,2] Rho_ID; // the correlation matrix
  
  Rho_ID = multiply_lower_tri_self_transpose(L_Rho_ID);
  
  log_lik = rep_vector(0, Nsub); // initial values for log_lik
  reward_pred_err = rep_vector(-100, All); // initial values for log_lik
  for(idx in 1:All) {
    if( Y[idx] > 0 ) {
      reward_pred_err[idx] = payoff[idx] - Q[sub[idx], thisRound[idx]][Y[idx], trial[idx]];
      log_lik[sub[idx]] = log_lik[sub[idx]] + q[sub[idx], thisRound[idx]][Y[idx],trial[idx]];
    }
  }
}

Exercise 3-4. What other priors could you choose for \(\alpha\) and \(\beta\)? How could you modify this model to integrate decision-biasing and/or value shaping? What priors could you use for \(\gamma\) and \(\eta\)? Note, that you may need to transform these parameters into an unbounded range. Which transformation is appropriate? Hint: \(\gamma\) has the same bounds as \(\alpha\).

Now let’s run the MCMC estimation on the MCMC estimation on the data.

simulated_data <- readRDS('data/simChoicesQlearning.Rds') # it should be replaced 
simulated_data_for_fit <- simulated_data %>% dplyr::filter(chosen == 1)

# Basic Q-learning model
stanmodel_Q_learning <- cmdstan_model('codeSnippets/Q_learning.stan') #compile model

#MCMC parameters, Note that we run a small number of samples since this is just an illustrative example
chains <- 4
parallel_chains <- 4
thin <- 1
iter_warmup <- 500 
iter_sampling <- 1000  

# preparing data fed to Stan
sim_dt <- list(
  All = nrow(simulated_data_for_fit)
  , Nsub = length(table(simulated_data_for_fit$agent)) # number of agents
  , Nround = length(table(simulated_data_for_fit$round)) # number of rounds
  , Ncue = 10 # number of options
  , Ntrial = 25 # number of trials
  , Ncon = 1 # number of global distributions 
  , sub = simulated_data_for_fit$agent
  , thisRound = simulated_data_for_fit$round
  , Y = simulated_data_for_fit$action # Y = 1, 2, 3, ,,, Ncue
  , trial = simulated_data_for_fit$trial
  , payoff = simulated_data_for_fit$reward # reward
)

# -- MCMC sampling --
fit <- stanmodel_Q_learning$sample(data = sim_dt, seed = 777, chains = chains, parallel_chains = parallel_chains, thin = thin, iter_warmup = iter_warmup, iter_sampling = iter_sampling)
fit$save_object(file = 'data/stanFit.stanmodel') #save to data 

Now let’s take a look at the results. Note that since both \(\alpha\) and \(\beta\) are bounded variables, the STAN code first transformed them into an unbounded range. We first need to transform them back using the inverse transformation (logit transform for \(\alpha\) and exponentiation for \(\beta\)) in order to examine them.

fit <- readRDS('data/stanFit.stanmodel') #read object from data, Note that this was removed from git because the model file is too large
#Extract samples
draws_df <- fit$draws(variables = c('mu_alpha', 'mu_beta'), format = "df") #extract the MCMC samples of the group-level alpha and beta estimates, turning it into a dataframe
#Transform parameters back to their original values
draws_df$alpha <- logisticFunc(draws_df$mu_alpha) #use an logit transform
draws_df$beta <- exp(draws_df$mu_beta) #use an exponentiation

#Pivot to a longer dataframe for plotting
plottingDF <- draws_df %>% pivot_longer(cols = c('alpha', 'beta'), names_to = 'parameter', values_to='estimate')

trueParams <- data.frame(parameter = c('alpha', 'beta'), estimate = c(.9,1))

ggplot(plottingDF, aes(x = estimate,  fill = parameter))+
  geom_histogram(aes(y = ..density..), color = 'black', bins = 30)+
  geom_vline(data = trueParams, aes(xintercept = estimate), color = 'black', linetype = 'dashed')+
  facet_grid(~parameter)+
  theme_classic()+
  xlab('Posterior Estimate')+
  ylab('Density')+
  scale_fill_brewer(palette = 'Dark2')+
  theme(legend.position = 'none', strip.background = element_blank())

The dashed lines indicate the true parameter values that generated the data. So it seems like our MCMC model did a good job of recovering the true parameter values.

Bayesian model comparison

For actually performing Bayesian model comparison, modern methods may seem quite daunting and quickly launch you off of a complexity curve. But in general, the intuitions are quite simple. We simply use a form of cross-validation and out-of-sample predictions to test goodness of fit. But rather than using the MLE as before, we use the full posterior distribution over parameters.

We include the code block below to give you an idea about how to access goodness of fit measures for a Bayesian STAN model, which are already computed when fitting the model. One of the easiest to interpret outputs is looic, which like AIC or BIC describes goodness of fit with lower values being better. However, we recommend referring to Vehtari, Gelman, and Gabry (2017) and McElreath (2020) for more details about how to interpret these quantities.

fit$loo()
## 
## Computed from 4000 by 4 log-likelihood matrix
## 
##          Estimate   SE
## elpd_loo   -385.0 31.4
## p_loo         1.5  0.3
## looic       770.0 62.8
## ------
## Monte Carlo SE of elpd_loo is NA.
## 
## Pareto k diagnostic values:
##                          Count Pct.    Min. n_eff
## (-Inf, 0.5]   (good)     0      0.0%   <NA>      
##  (0.5, 0.7]   (ok)       3     75.0%   481       
##    (0.7, 1]   (bad)      1     25.0%   120       
##    (1, Inf)   (very bad) 0      0.0%   <NA>      
## See help('pareto-k-diagnostic') for details.

While the Bayesian framework is the most robust, it is also the most complex approach to computational modeling. As we’ve seen above, the most complex method isn’t always the best one. Rather, we recommend that you use whichever method is right for the job. Ultimately, the correct choice of methods is whatever you can justify to yourself and the scientific community.

Exercise 3-5. Fit an alternative model to the data (e.g., value shaping) and perform a model comparison. How well can Bayesian model selection recover the true model compared to AIC, BIC, or cross validation?

Model fitting exercise

As a hands-on tutorial, we’ve prepared two simulated datasets of choices. You can download them yourself using these links:

meteorDF <- read_csv('modelFittingData/meteor.csv') 
cometDF <- read_csv('modelFittingData/comet.csv') 

knitr::kable(
  list(head(meteorDF, n=10), head(cometDF, n=10)),
  caption = 'Simulated datasets',
  booktabs = TRUE, valign = 't'
)
Simulated datasets
…1 trial round agent reward choice
1 1 1 1 1.394896 6
2 1 1 2 6.316758 8
3 1 1 3 8.294503 9
4 1 1 4 2.370241 6
5 2 1 1 10.749758 10
6 2 1 2 4.841205 8
7 2 1 3 1.433119 6
8 2 1 4 6.110761 8
9 3 1 1 10.770239 10
10 3 1 2 5.858215 8
…1 trial round agent reward choice
1 1 1 1 -8.802858 1
2 1 1 2 7.816797 9
3 1 1 3 9.492214 9
4 1 1 4 -5.848119 3
5 2 1 1 9.153988 9
6 2 1 2 -7.123128 2
7 2 1 3 8.075340 9
8 2 1 4 8.093050 9
9 3 1 1 1.326370 6
10 3 1 2 7.767255 9

Below, we provide code for running model fitting on your own

#Here is all of the code you will need to run model fitting


#1. Load the data
meteorDF <- read_csv('meteor.csv')   #Make sure the file path is correct, based on where you downloaded the datasets
cometDF <- read_csv('comet.csv') 

k <- length(unique(c(cometDF$choice,meteorDF$choice))) #number of arms; which we can extract from the choices in the data

#softmax function
softmax <- function(beta, Qvec){
  p <- exp(beta*Qvec)
  p <- p/sum(p) #normalize to sum to 1
  return(p)
}

#2. Likelihood functions for different models

#Asocial Q-learning 
asocialLikelihood <- function(params, data, Q0=0){ #We assume that prior value estimates Q0 are fixed to 0 and not estimated as part of the model
  names(params) <- c('alpha', 'beta') #name parameter vec
  nLL <- 0 #Initialize negative log likelihood
  rounds <- max(data$round)
  trials <- max(data$trial)
  for (r in 1:rounds){ #loop through rounds
    Qvec <- rep(Q0,k) #reset Q-values each new round
    for (t in 1:trials){ #loop through trials
      p <- softmax(params['beta'], Qvec) #compute softmax policy
      trueAction <- subset(data, trial==t & round == r)$choice
      negativeloglikelihood <- -log(p[trueAction]) #compute negative log likelihood
      nLL <- nLL + negativeloglikelihood #update running count
      Qvec[trueAction] <- Qvec[trueAction] + params['alpha']*(subset(data, trial==t & round == r)$reward - Qvec[trueAction]) #update q-values
    }
  }
  return(nLL)
}


#Decision-biasing 
dbLikelihood <- function(params, data, targetAgent, Q0=0){ #Note that we add targetAgent as an additional argument, since we need to pass through the entire social data
  names(params) <- c('alpha', 'beta', 'gamma', 'theta') #name parameter vec
  nAgents <- length(unique(data$agent)) #how many agents are in the sim?
  nLL <- 0 #Initialize negative log likelihood
  rounds <- max(data$round)
  trials <- max(data$trial)
  for (r in 1:rounds){ #loop through rounds
    Qvec <- rep(Q0,k) #reset Q-values each new round
    socialObs <- rep(NA, nAgents) #social observations
    for (t in 1:trials){ #loop through trials
      for (ag in 1:nAgents){
        if (ag == targetAgent){ #compute likelihood for target agent
          p_ind <- softmax(params['beta'], Qvec) #individual learning policy
          if (t==1){ #first trial has no social information
            p_fdc = rep(1/k, k) #uniform probabilities
          }else{
            socialFreq <- table(factor(socialObs[-ag], levels = 1:k)) + .0001 #Frequency of each socially observed action + a very small number to avoid divide by zero
            socialFreq <- socialFreq^params['theta'] #add conformity
            p_fdc <- socialFreq/sum(socialFreq) #compute probabilities
          }
           p_db <- ((1-params['gamma']) * p_ind) + (params['gamma'] * p_fdc) #mixture policy
           trueAction <- subset(data, trial==t & round == r & agent==targetAgent)$choice
           negativeloglikelihood <- -log(p_db[trueAction]) #compute negative log likelihood
           nLL <- nLL + negativeloglikelihood #update running count
           Qvec[trueAction] <- Qvec[trueAction] + params['alpha']*(subset(data, trial==t & round == r & agent==targetAgent)$reward - Qvec[trueAction]) #update q-values
        }else{ #not target agent
          #record choices and add to socialObs
          socialObs[ag] <- subset(data, trial==t & round==r & agent==ag)$choice
        }
      }
    }
  }
  return(nLL)
}


#Value shaping
vsLikelihood <- function(params, data, targetAgent, Q0=0){ #Note that we add targetAgent as an additional argument, since we need to pass through the entire social data
  names(params) <- c('alpha', 'beta', 'eta') #name parameter vec
  nAgents <- length(unique(data$agent)) #how many agents are in the sim?
  nLL <- 0 #Initialize negative log likelihood
  rounds <- max(data$round)
  trials <- max(data$trial)
  for (r in 1:rounds){ #loop through rounds
    Qvec <- rep(Q0,k) #reset Q-values each new round
    socialObs <- rep(NA, nAgents) #social observations
    for (t in 1:trials){ #loop through trials
      for (ag in 1:nAgents){
        if (ag == targetAgent){ #compute likelihood for target agent
          socialFreq <- table(factor(socialObs[-ag], levels = 1:k)) + .0001 #Frequency of each socially observed action + a very small number to avoid divide by zero
          Qvec<- Qvec + (params['eta']*socialFreq) 
          p <- softmax(params['beta'], Qvec) 
          trueAction <- subset(data, trial==t & round == r & agent==targetAgent)$choice
          negativeloglikelihood <- -log(p[trueAction]) #compute negative log likelihood
          nLL <- nLL + negativeloglikelihood #update running count
          Qvec[trueAction] <- Qvec[trueAction] + params['alpha']*(subset(data, trial==t & round == r & agent==targetAgent)$reward - Qvec[trueAction]) #update q-values
        }else{ #not target agent
          #record choices and add to socialObs
          socialObs[ag] <- subset(data, trial==t & round==r & agent==ag)$choice
        }
      }
    }
  }
  return(nLL)
}

#3. Fit the model to each participant, by minimizing the nLL

# Here we show examples using the asocial learner, decision-biasing, and value-shaping, fit to agent 1 on the comet dataset. You will need to write your own code to fit each models to each participant in each dataset

#Asocial learner
init <- c(1,1) #initial guesses for alpha and beta
lower <- c(0,-Inf) #lower and upper limits. We use very liberal bounds here, but you may want to set stricter bounds for better results
upper <- c(1,Inf)

MLE <- optim(par=init, fn = asocialLikelihood,lower = lower, upper=upper, method = 'L-BFGS-B', data = subset(meteorDF, agent==1))
MLE$value #nLL of the MLE
MLE$par #Parameter estimates of the MLE

#DecisionBiasing
init <- c(1,1,1,1) #initial guesses for alpha, beta, gamma, and theta
lower <- c(0,-Inf,0, 1) #lower and upper limits. We use very liberal bounds here, but you may want to set stricter bounds for better results
upper <- c(1,Inf,1, Inf)

MLE <- optim(par=init, fn = dbLikelihood,lower = lower, upper=upper, method = 'L-BFGS-B', data = meteorDF , targetAgent=1)
MLE$value #nLL of the MLE
MLE$par #Parameter estimates of the MLE

#value shaping
init <- c(1,1,1) #initial guesses for alpha, beta, and eta
lower <- c(0,-Inf,0) #lower and upper limits. We use very liberal bounds here, but you may want to set stricter bounds for better results
upper <- c(1,Inf,Inf)

MLE <- optim(par=init, fn = vsLikelihood,lower = lower, upper=upper, method = 'L-BFGS-B', data = meteorDF, targetAgent=1)
MLE$value #nLL of the MLE
MLE$par #Parameter estimates of the MLE

#4. Compare models

#Here we provide functions for computing the AIC and BIC, but you could also modify the optimization function above to perform cross validation

AIC <- function(MLE){
  aic <- (2*MLE$value) + (2* length(MLE$par))
  return(aic)
}

AIC(MLE)
#nObs can be defined as nrow(subset(data, agent == targetAgent))
BIC <- function(MLE, nObs){
  bic <- (2*MLE$value) + (length(MLE$par) * log(nObs))
  return(bic)
}

BIC(MLE, nrow(subset(cometDF, agent == 1)))

Robustness

Having determined which model performs best from our initial set of hypotheses, we now need to make sure that it robustly captures what we designed it to capture. Only then, can we interpret the parameters and discover new insights beyond the behavioral data.

We start by describing three tests for robustness, which diagnose potential flaws with our modeling framework.

  1. Model recovery. Can the data actually differentiate between the models we are considering? Could there be model mimicry, where the wrong model can mistakenly win?

  2. Parameter recovery. Are the parameters of the model capturing distinct phenomenon? Can changes in one parameter be acommodated by changes in another parameter (i.e., misspecification)?

  3. Simulated data. Can the model generate realistic participant behavior? Is it capturing the mechanisms that matter for performance, rather than simply fitting the noise?

Model Recovery

The basic principle of model recovery is that we want to simulate data with a given model and set of parameters, and test whether our model estimation and model comparison framework is capable of recovering the true model.

Model recovery procedure

Once we have performed these simulations, there are two different ways we can interpret the results.

The first is to compute a confusion matrix to describe the probability that model \(i\) is the best fit given data from model \(j\). We simply refer to this as \(P(\text{fit}|\text{sim})\) for very pair-wise combination of models, and is visualized in the plot on the left. In the plot, the rows sum to 1 and tell us for any given model used to simulate data (each row), which models best describe the data. The diagonals of the plot correspond to the correct correspondence where \(\text{fit}=\text{sim}\). In an ideal world, this would always be equal to 1. But as we can see for model 5, it often gets mimicked by models 3 and 4, suggesting potential issues with the modeling framework.

The second method we can use to interpret model recovery is via the inversion matrix. This requires using Bayes’ rule to compute the probability that if a model is the best fit, it was also the true model that simulated the data: \(P(\text{sim}|\text{fit}) = \frac{P(\text{fit}|\text{sim})P(sim)}{\sum_{\text{sim}}P(\text{fit}|\text{sim})P(sim)}\). This is visualized in the plot on the right, where now, the columns sum to 1. Moving along each column, we can read out the probability that if we found a given model to be the best fit for our participant data, how likely is it to have been the true model that generated the data? Here we see high confidence for model 5. Since it is often mimicked and beat out by models 3 and 4, if it actually wins, then it is quite likely to be the true model. This is less true for model 1, which even though it had \(P(\text{fit}|\text{sim})=1\), it also mimicked models 3 and 4. Thus, the evidence for model 1 is diluted due to this mimicry.

Interpreting model recovery. Adapted from Wilson and Collins (2019)

As you can see, both analyses play an important role in understanding how we can interpret our modeling results. It is often best to run these analyses before even acquiring any data. We can simply use some best guesses about plausible parameters to simulate data. Issues with model recovery at this stage may suggest tweaks to the experiment design or changes in how the task or models are specified. Figuring out issues with model recoverability prior to running an experiment can save you a lot of time and resources.

Exercise 3-6. Another method to improve how well you can differentiate between competing models is to design more diagnostic experiments. One framework is to use Optimal Experimental Design (Myung and Pitt 2009) to develop experiments that maximally differentiate between different hypotheses. One simple method you can try out is to select two different models, and then simulate performance across a wide range of task designs (e.g., number of options, reward distributions, number of agents, etc…). Which set of task parameters provides the best recovery of the true model?

Parameter recovery

Next, it is important to check whether the fitting procedure captures distinct and behaviorally specific parameter estimates. This procedure is known as Parameter Recovery, where the goal is to assess whether estimated parameters correspond to the true parameters that were used to simulate the data.

Parameter recovery. Adapted from Wilson and Collins (2019)

Using either participant parameter estimates or some prior guess about a plausible range of parameters, we simulate data from our models. Then, we run our model estimation procedure on this simulated data and examine the new parameter estimates we acquire. The goal of parameter recovery is to see whether the fit parameters correspond to the simulated parameters. If not, the parameters may be misspecified. This may due to a variety of issues. For instance two parameters may be mutually capturing a similar mechanism, such that changes in one parameter can be compensated for by changes in the other. There may also be lack of sensitivity in the overall framework, where changes to a model parameter fail to manifest in clear behavioral differences.

Since one of the main goals of computational modeling is to be able to interpret our model parameters, running parameter recovery is an important step in ensuring these parameters correspond to some meaningful characteristic of behavior.

We provide an example below, where we first simulate new data for 20 Q-learning agents with different parameter values (rather than the same parameters as before).

#Task parameters for bandit with Gaussian rewards
k <- 10 #number of arms
meanVec <- seq(-10,10, length.out=k) #Payoffs means
sigmaVec <- rep(1, k) #Payoff stdevs
trials <- 25 #total number of trials
nAgents <- 20
rounds <- 10

banditGenerator <- function(a) {#a is an integer or a vector of integers, selecting one of the 1:k arms of the bandit
  payoff <- rnorm(n = length(a), mean = meanVec[a], sd = sigmaVec[a])
  return (payoff)
}

#Model parameters
alpha <- runif(nAgents, 0,1) #sample from a uniform prior
beta <- rlnorm(nAgents) #sample from a log-normal prior
Q0 <- Qvec <-  rep(0,k) #prior initialization of Q-values

#Now simulate data for multiple agents over multiple rounds
simDF <- data.frame()
for (a in 1:nAgents){ #loop through agents
  for (r in 1:rounds){ #loop through rounds
    Qvec <- Q0 #reset Q-values
    for (t in 1:trials){ #loop through trials
      p <- softmax(beta[a], Qvec) #compute softmax policy
      action <- sample(1:k,size = 1, prob=p) #sample action
      reward <- banditGenerator(action) #generate reward
      Qvec[action] <- Qvec[action] + alpha[a]*(reward - Qvec[action]) #update q-values
      chosen <- rep(0, k) #create an index for the chosen option
      chosen[action]<- 1 #1 = chosen, 0 = not
      trialDF <- data.frame(trial = t, agent = a, round = r, Q = Qvec, action = 1:k, chosen = chosen, reward = reward, true_alpha = alpha[a], true_beta = beta[a])
      simDF <- rbind(simDF,trialDF)
    }
  }
}
simDF <- simDF %>% filter(chosen==1)
saveRDS(simDF, 'data/simChoicesQlearning20.Rds')

Now let’s use a simple MLE to estimate the parameters and see how well we recover the true parameters.

simDF <- readRDS('data/simChoicesQlearning20.Rds')
simDF$estimated_alpha = rep(NA, rounds * trials * nAgents)
simDF$estimated_beta = rep(NA, rounds * trials * nAgents)

#Now let's optimize the parameters
init <- c(1,1) #initial guesses
lower <- c(0,0) #lower and upper limits
upper <- c(1,Inf)
for (i in 1:nAgents) {
  this_agent <- simDF %>% filter(agent == i)
  this_MLE <- optim(par=init
                    , fn = Qlikelihood
                    , lower = lower
                    , upper=upper
                    , method = 'L-BFGS-B'
                    #, method = 'BFGS'
                    , data = this_agent )
  simDF$estimated_alpha[which(simDF$agent == i)] <- this_MLE$par[1] 
  simDF$estimated_beta[which(simDF$agent == i)] <- this_MLE$par[2] 
}

parameterRecoveryDF <- simDF %>% filter(trial==1 & round==1)  #we only want a single row per participant, since we're only interested in the simulated and fit parameters

#Compute correlations
cor_alpha <- cor.test(parameterRecoveryDF$true_alpha, parameterRecoveryDF$estimated_alpha)
cor_beta <- cor.test(parameterRecoveryDF$true_beta, parameterRecoveryDF$estimated_beta)
cors <- data.frame(true_alpha = 0.3 #These values are used here for plotting purposes, to specify the location of the text labels with geom_text
                   , estimated_alpha = 0.9
                   , cor_alpha = cor_alpha$estimate %>% round(3)
                   , true_beta = 1
                   , estimated_beta = 5
                   , cor_beta = cor_beta$estimate %>% round(3))

#Plot results
alpha_recovery <- parameterRecoveryDF %>% 
  ggplot(aes(true_alpha, estimated_alpha)) +
  geom_point() +
  geom_smooth(method='lm', formula = y ~ x) +
  geom_text(data=cors, aes(label = paste("r =",cor_alpha)), hjust = 0, vjust = 0) + 
  theme_classic() 

beta_recovery <- parameterRecoveryDF %>% 
  ggplot(aes(true_beta, estimated_beta)) +
  geom_point() +
  geom_smooth(method='lm', formula = y ~ x) +
  geom_text(data=cors, mapping = aes(label = paste("r =",cor_beta)), hjust = 0, vjust = 0) + 
  theme_classic() 


plot_grid(alpha_recovery, beta_recovery)

Both parameters recover quite well, although we see at least one substantial outlier in our estimated \(\beta\). If these results are not satisfying, we may need a different optimization algorithm, to change the parameters of the task (e.g., add more trials to estimate the model on longer learning trajectories), or swap our model estimation procedure for something more robust (e.g., cross-validation or MCMC sampling).

Exercise 3-7. How well can we recover the parameters of our social learning models? Choose one of the model fitting procedures we described above and run your own parameter recovery.

Simulated behavior

Lastly, we want to ensure that not only do the model fits and parameters recover, but that we are able to capture the relevant aspects of behavior through our model. To test this, we can examine model performance, choices, or any other behavioral variables we can simulate with the model. With the parameters estimated from participants, can we generate realistic behavior?

While we don’t have any real participants to compare against, perhaps we can test if we can see any reliable differences in performance with different values of \(\alpha\) and \(\beta\). In the code snippet below, we perform a median split on \(\alpha\) and \(\beta\) to see if we can detect any differences in performance.

#Median split on alpha and beta
medianAlpha <- median(parameterRecoveryDF$true_alpha)
medianBeta <- median(parameterRecoveryDF$true_beta)
simDF$alphaSplit <- ifelse(simDF$true_alpha <medianAlpha, 'Low Alpha', 'High Alpha')
simDF$betaSplit <- ifelse(simDF$true_beta <medianBeta, 'Low Beta', 'High Beta')

simBehavDF <- simDF %>% group_by(agent, trial, alphaSplit,betaSplit ) %>% summarize(reward = mean(reward)) #average rewards over rounds

ggplot(simBehavDF, aes(x = trial, y = reward, color = interaction(alphaSplit, betaSplit)))+
  geom_line(aes(group=agent))+
  theme_classic()+
  xlab('Trial')+
  ylab('Avg. Reward')+
  facet_grid(alphaSplit~betaSplit)+
  theme(legend.position='none', strip.background = element_blank())+
  scale_color_brewer(palette = 'Dark2')

In general, we see much more variability in performance for the Low Beta condition, although it is harder to detect differences between Low Alpha and High Alpha. However, looking at performance alone is a rather coarse analysis, and it will be up to the researcher to determine whether find diagnostic behavioral signatures of different model parameters.

General Recipe

To wrap up this tutorial, we provide a general recipe and discuss social learning specific challenges

  1. What are your hypotheses? Formalize them into models

  2. Decide how to estimate your model parameters and perform model comparison

  3. Is your modeling framework robust? If not, rethink your task, the models, and/or your modeling framework.

  4. [Collect Data]

  5. Analyze and interpret results

  6. Test if recoverability still works with participant parameters

Social learning specific challenges

  • Social learning strategies have frequency-dependent fitness. This means that performance in social learning environments don’t only depend on one’s own learning strategy, but also the strategies of others. As we saw from Rogers’ paradox, the effectiveness of imitation decays rapidly as the number of imitators in the population increases. This means that model simulations don’t only depend on the parameters we specify the agent, but also the make-up of the group it is interacting with. Accordingly, objective superiority of one strategy over another can only be demonstrated with evolutionary simulations, which can be very computationally costly.

  • So far, the social learning models we have used as examples all treat other individuals as the same. These are known as conformity biased strategies. However, much of social learning is selective in learning from successful or prestigious individuals. Thus, we need models to account for selectivity biases as well, but without ballooning in complexity.

  • We only very briefly touched on Theory of Mind reasoning, where individuals infer the hidden mental states of other agents. Yet, modeling this type of inference is very difficult, even using sample-based approximations. Even more difficult, is the problem of infinite recursion, where an agent may reason ad infinitum about what other individuals think about themselves, and so on and so on

  • Capturing sophistication of social learning may come with the trade-off of needing to simplify individual learning mechanisms

References

Gigerenzer, Gerd, and Henry Brighton. 2009. “Homo Heuristicus: Why Biased Minds Make Better Inferences.” Topics in Cognitive Science 1 (1): 107–43.
Lee, Jaewook, Woosuk Sung, and Joo-Ho Choi. 2015. “Metamodel for Efficient Estimation of Capacity-Fade Uncertainty in Li-Ion Batteries for Electric Vehicles.” Energies 8 (6): 5538–54.
Lewandowsky, Stephan, and Simon Farrell. 2010. Computational Modeling in Cognition: Principles and Practice. SAGE publications.
McElreath, Richard. 2020. Statistical Rethinking: A Bayesian Course with Examples in r and Stan. Chapman; Hall/CRC.
Myung, Jay I, and Mark A Pitt. 2009. “Optimal Experimental Design for Model Discrimination.” Psychological Review 116 (3): 499.
Stone, Mervyn. 1977. “An Asymptotic Equivalence of Choice of Model by Cross-Validation and Akaike’s Criterion.” Journal of the Royal Statistical Society: Series B (Methodological) 39 (1): 44–47.
Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2017. “Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC.” Statistics and Computing 27 (5): 1413–32.
Wilson, Robert C, and Anne GE Collins. 2019. “Ten Simple Rules for the Computational Modeling of Behavioral Data.” eLife 8: e49547.