packages <- c('tidyverse', 'nnet', 'data.table', 'cowplot')
#invisible(lapply(packages, install.packages, character.only = TRUE)) #Install packages if any are missing
invisible(lapply(packages, require, character.only = TRUE)) #Load all packages
options(dplyr.summarise.inform = FALSE) #suppress annoying messages
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
set.seed(1234) #set seed for reproducibility
This is R Notebook used in Tutorial 2 of COSMOS Konstanz. In this session, we focus on describing and programming computational models to describe both individual and social learning mechanisms. We use these models to both simulate data, and also describe how the models can be estimated on data (either from participants or simulated data) to infer parameters. Let’s jump in!
Brisk Introduction to Reinforcement Learning
We start with a very quick introduction to the Reinforcement Learning (RL) framework by Sutton & Barto (2018), which describes how an agent (either biological or machine) can learn how to behave intelligently by interacting with the environment. At each time point \(t\), the agent selects an action \(a \in A\) from a set of possible actions \(A\), and then receives feedback from the environment in the form of a new state \(s_t\) and a reward observation \(r_t\). Since we will focus on a simple bandit task, we can ignore the state information for now and assume that rewards are solely a function of the action \(r_t := R(a_t)\).
Learning a value function
We will start with a Q-learning model (Watkins & Dayan, 1992), as one of the simplest RL models. By trying out actions and receiving rewards, the Q-learning agent learns a value function for each action \(Q(a)\) describing the expected reward. With each reward observation, the current Q-value \(Q_t(a)\) is updated with the following equation:
\[ \begin{equation} Q_{t+1}(a) \leftarrow Q_t(a) + \alpha\color{red}{[r_t(a) - Q_t(a) ]} \end{equation} \] The term in the equation colored red is known as the reward prediction error (RPE). This captures the difference between the expected reward \(Q_t(a)\) and the actual reward observation. When this is lower than expected, we update our expectation to be lower, and vice versa.
The amount we update our predictions is governed by the learning rate parameter \(\alpha\), which is bounded to the range \([0,1]\). This is often implemented as a free parameter, which is then estimated from data to describe a characteristic of the learner(s). We can estimate \(\alpha\) separately for different experimental manipulations to see how learning changes, for example, in contexts of punishment vs. reward (Palminteri et al., 2015) or how group size influences the rate of individual learning in social contexts (Toyokawa et al., 2019).
The Q-learning model also requires some prior initialization of Q-values, which can often be set to \(Q_{t=0}=0\). However, there are also settings where optimistic initialization (Sutton & Barto, 2018) allows agents to learn and explore more efficiently. Signatures of optimistic exploration have been found in human behavior (Wilson et al., 2014; Wu et al., 2018) and have been included in models of social learning (Ho et al., 2019).
Exercise 2.1. Can you modify the Q-learning model to be Bayesian? This will allow the model to learn not only the expected reward, but also quantify the level of uncertainty about the outcome of each option. You might be highly confident about the quality of your favorite coffee shop, but a single bad experience at a new restaurant might not be very strong evidence you should never return. One approach is to use a Kalman Filter (Dayan et al., 2000; Speekenbrink & Konstantinidis, 2015; Wu, Schulz, et al., 2022) to learn a distribution over rewards, with the simplifying assumption that each option corresponds to an independent Gaussian distribution and that learning follows linear Gaussian dynamics. Gershman (2015) shows how a Kalman filter provides a Bayesian analogue to the RPE learning updates of the Q-learning model we describe above.
Converting values to actions
In order to convert values to actions, we need a policy \(\pi\) defining a probability distribution over actions. Each action \(a\) is assigned probability \(\pi(a)\), which can be used to either predict or simulate future actions of the agent. One common method is to use a softmax policy, which selects actions proportional to their exponentiated Q-values:
\[ \begin{equation} \pi(a) \propto exp\left(\beta Q(a)\right) \end{equation} \] \(\beta\) is the inverse temperature parameter, governing the amount of stochasticity in the policy. When \(\beta\) is small, there is more randomness in the policy with higher probability given to lesser-valued options. As \(\beta\) grows larger, the policy tend to more strongly favor the option with the highest value. The code below visualizes the softmax policy across a range of different \(\beta\) values and at different combinations of Q-values for a simple 2-armed bandit.
softmax <- function(beta, Qvec){
p <- exp(beta*Qvec)
p <- p/sum(p) #normalize to sum to 1
return(p)
}
Qdiff <- seq(-5,5,length.out=100) #Let's fix the value of option 2 to 0 and then modulate the value of option 1 between -5 and 5
#Generate softmax policies for different beta values
softmaxDF <- data.frame()
for (beta in c(0.5, seq(1,4,by=1))){
sims <- sapply(Qdiff, FUN=function(diff) softmax(beta, c(diff,0))) #Compute softmax across the range of value differences
simDF <- data.frame(Qdiff = Qdiff, prob1 = sims[1,], prob2=sims[2,], beta = beta) #Put the value difference, choice probabilities, and beta parameter into a data frame
softmaxDF <- rbind(simDF, softmaxDF)
}
softmaxDF$beta <- factor(softmaxDF$beta)
ggplot(softmaxDF, aes(x = Qdiff, y = prob1, color = beta))+
geom_line()+
theme_classic()+
scale_color_viridis_d()+
labs(title = "Softmax policy",
x = expression(Q[1]-Q[2]),
y = 'Probability of action 1',
col = expression(beta))+
theme(legend.position = c(1,.1), legend.justification = c(1,0))
Exercise 2.2. What other policies could you consider? Think about which strategies you tried in Tutorial 1 during the live demo of the individual learning experiment. Did you find that you sometimes choose an option randomly? You could describe that using an \(\epsilon\)-greedy policy (Sutton & Barto, 2018) Did you tend to explore options you visited the least? Then you could implement a count-based exploration policy (Machado et al., 2020) or uncertainty-directed exploration policy such as upper confidence bound sampling (Auer et al., 2002). Note, that to implement the latter, you will need to have implemented a Bayesian RL model (Exercise 2.1) in order to quantify the uncertainty for each option.
Simulating data
We’re now ready to simulate some data with this Q-learning model paired with a bandit environment we defined in Tutorial 1. We first need to specify some task parameters, such as the number of arms, the means and standard deviations of rewards, number of rounds, and number of trials per round. Then we need to define the model parameters, where we start with some sensible values of \(\alpha=.9\), \(\beta=1\), and \(Q_0=0\). The code block below describes the entire simulation procedure and plots reward curves over each individual round (colored lines) and aggregated over 10 rounds (black line).
#Task parameters for bandit with Gaussian rewards
k <- 10 #number of arms
meanVec <- seq(-10,10, length.out=k) #Payoff means
sigmaVec <- rep(1, k) #Payoff stdevs
T <- 25 #total number of trials
nAgents <- 4
rounds <- 10
banditGenerator <- function(a) {#a is an integer or a vector of integers, selecting one of the 1:k arms of the bandit
payoff <- rnorm(n = length(a), mean = meanVec[a], sd = sigmaVec[a])
return (payoff)
}
#Model parameters
alpha <- .9 #learning rate
beta <- 1 #Softmax inverse temperature
Q0 <- Qvec <- rep(0,k) #prior initialization of Q-values
#Now simulate data for multiple agents over multiple rounds
simDF <- data.frame()
for (a in 1:nAgents){ #loop through agents
for (r in 1:rounds){ #loop through rounds
Qvec <- Q0 #reset Q-values
for (t in 1:T){ #loop through trials
p <- softmax(beta, Qvec) #compute softmax policy
action <- sample(1:k,size = 1, prob=p) #sample action
reward <- banditGenerator(action) #generate reward
Qvec[action] <- Qvec[action] + alpha*(reward - Qvec[action]) #update q-values
chosen <- rep(0, k) #create an index for the chosen option
chosen[action]<- 1 #1 = chosen, 0 = not
trialDF <- data.frame(trial = t, agent = a, round = r, Q = Qvec, action = 1:k, chosen = chosen, reward = reward)
simDF <- rbind(simDF,trialDF)
}
}
}
saveRDS(simDF, 'data/simChoicesQlearning.Rds')
#Plot results
ggplot(subset(simDF, chosen==1), aes(x = trial, y = reward, color = interaction(agent,round)))+
geom_line(size = .5, alpha = 0.5)+
stat_summary(fun = mean, geom='line', color = 'black', size = 1)+
theme_classic()+
xlab('Trial') +
ylab('Reward')+
ggtitle('Simulated performance')+
theme(legend.position='none')
Demo 1: Tweaking individual learning parameters
Simulated behavior is sensitive to the choice of model parameters. Under the following link, you can access an interactive Shiny app to explore how different model parameters influence the simulated learning curves.
Likelihood function
Beyond only simulating data, we also want to use computational models to describe experimental data collected from subjects. Which model type provides a better description will allow us to test different hypotheses, while the estimated parameters of the winning model should provide an interpretable characterization of behavior.
To fit a model to data, we first need to define a likelihood function:
\[ \begin{equation} P(D|\theta) \end{equation} \] The likelihood function describes the likelihood of the observed data \(D\) (i.e., actions made by a subject) given a set of parameters \(\theta\) (e.g., \(\alpha\) and \(\beta\)).
Since we are hardly ever modeling a single behavioral data point, but rather a sequence of observations (e.g., over multiple trials of learning and multiple rounds of a task), we need to describe the joint likelihood over all observations under consideration \(P(D|\theta) = \prod_t P(d_t|\theta)\), which is the product of the likelihood of each observation \(P(d_t|\theta)\). This becomes much easier using logarithms, since we can replace multiplication with summation in log-space and simply sum over the log-likelihoods:
\[ \begin{equation} \log P(D|\theta) = \sum_t \log P(d_t|\theta) \end{equation} \] Note that natural logs \(\ln\) are most commonly used rather than base 10 logarithms \(\log_{10}\).
The (natural) log of any probability will be negative (since probabilies are always less than or equal to 1). Thus, it’s more convenient to express the fit of a model in terms of the negative log-likelihood (nLL), by simply inverting the sign of the previous equation:
\[ \begin{equation} nLL= -\sum_t \log P(d_t|\theta) \end{equation} \] The nLL will always be a value greater than zero, with smaller values indicating a better fit. Thus, nLL expresses the amount of error or loss in quantifying how well a model (given a set of parameters) fits the data, and is sometimes also called “log loss”.
In practice, we can define a likelihood function (using very similar code to our model simulations) that takes a set of parameters and participant data, and then returns the nLL. The likelihood function iteratively loops through the trials of the task and predicts the next action based on the current policy \(\pi_t\). We then take the log probability of the actual action that was selected, and use that to update our running tally of the nLL. Remember that we also need to use the actual action and reward observations to update the value function of the agent in order for the model predictions to evolve following the learning dynamics of our model.
Lastly, we wrap the likelihood function in an optimization function in order to find the set of parameters that minimizes the nLL. By minimizing the loss of the model, we arrive at an estimate of the most likely set of parameters known as the Maximum Likelihood Estimate:
likelihood <- function(params, data, Q0=0){ #We assume that prior value estimates Q0 are fixed to 0 and not estimated as part of the model
names(params) <- c('alpha', 'beta') #name parameter vec
nLL <- 0 #Initialize negative log likelihood
rounds <- max(data$round)
trials <- max(data$trial)
for (r in 1:rounds){ #loop through rounds
Qvec <- rep(Q0,k) #reset Q-values each new round
for (t in 1:trials){ #loop through trials
p <- softmax(params['beta'], Qvec) #compute softmax policy
trueAction <- subset(data, chosen==1 & trial==t & round == r)$action
negativeloglikelihood <- -log(p[trueAction]) #compute negative log likelihood
nLL <- nLL + negativeloglikelihood #update running count
Qvec[trueAction] <- Qvec[trueAction] + params['alpha']*(subset(data, chosen==1 & trial==t & round == r)$reward - Qvec[trueAction]) #update q-values
}
}
return(nLL)
}
#Now let's optimize the parameters
init <- c(1,1) #initial guesses
lower <- c(0,-Inf) #lower and upper limits. We use very liberal bounds here, but you may want to set stricter bounds for better results
upper <- c(1,Inf)
MLE <- optim(par=init, fn = likelihood,lower = lower, upper=upper, method = 'L-BFGS-B', data = subset(simDF, chosen==1 & agent==1) )
Using the data we simulated earlier using \(\alpha\) = 0.9 and \(\beta\) = 1.0, we arrive at a MLE estimate of \(\hat{\alpha}\) = 0.95 and \(\hat{\beta}\) = 1.10. This MLE corresponds to a negative log likelihood 109.5659023 for this MLE, quantifying how good of a fit it provides.
Loss landscape
We can also compute nLLs across a range of parameter combinations in order to visualize the loss landscape. The plot below illustrates how some regions of the parameter space produce similar fits to the data.
#Create a grid of plausible parameter combinations
paramCombs <- expand.grid(alpha = seq(0,1, length.out=20), beta = seq(0,20, length.out=20))
#Now compute nLLs for each grid combination
nLLs <- sapply(1:nrow(paramCombs), FUN=function(i) likelihood(as.numeric(paramCombs[i,]), subset(simDF, chosen==1 & agent==1))) #This can be a bit slow
paramCombs$nLL <- nLLs #add to dataframe
bestFit <- paramCombs[paramCombs$nLL == min(nLLs),] #best fitting combination
#plot data
ggplot(paramCombs)+
geom_tile(aes(x = alpha, y = beta, fill = nLL)) +
geom_contour(aes(x = alpha, y = beta, z = nLL), color = 'white')+
geom_point(data=data.frame(alpha=alpha, beta=beta, type = 'true'), aes(alpha,beta, shape = type), size = 4, color = 'red')+ # true value
geom_point(data=data.frame(alpha=bestFit$alpha, beta=bestFit$beta, type = 'MLE'), mapping=aes(alpha,beta, shape = type), size = 4, color = 'red')+ # best fitting value
scale_fill_viridis_c('nLL', direction = -1)+
labs(title = "Loss landscape",
x = expression(alpha),
y = expression(beta))+
scale_shape_manual(values= c(4, 0), name = '')+
theme_classic()+
scale_x_continuous(expand = c(0, 0))+
scale_y_continuous(expand = c(0, 0))