Tutorial 2 - Models of social and individual learning

Charley Wu and Wataru Toyokawa

2023-07-01

packages <- c('tidyverse', 'nnet', 'data.table', 'cowplot')
#invisible(lapply(packages, install.packages, character.only = TRUE)) #Install packages if any are missing
invisible(lapply(packages, require, character.only = TRUE)) #Load all packages
options(dplyr.summarise.inform = FALSE) #suppress annoying messages
knitr::opts_chunk$set(warning = FALSE, message = FALSE) 
set.seed(1234) #set seed for reproducibility

This is R Notebook used in Tutorial 2 of COSMOS Konstanz. In this session, we focus on describing and programming computational models to describe both individual and social learning mechanisms. We use these models to both simulate data, and also describe how the models can be estimated on data (either from participants or simulated data) to infer parameters. Let’s jump in!

Brisk Introduction to Reinforcement Learning

We start with a very quick introduction to the Reinforcement Learning (RL) framework by Sutton & Barto (2018), which describes how an agent (either biological or machine) can learn how to behave intelligently by interacting with the environment. At each time point $t$, the agent selects an action $a \in A$ from a set of possible actions $A$, and then receives feedback from the environment in the form of a new state $s_t$ and a reward observation $r_t$. Since we will focus on a simple bandit task, we can ignore the state information for now and assume that rewards are solely a function of the action $r_t := R(a_t)$.

Adapted from Sutton & Barto (2018)

Learning a value function

We will start with a Q-learning model (Watkins & Dayan, 1992), as one of the simplest RL models. By trying out actions and receiving rewards, the Q-learning agent learns a value function for each action $Q(a)$ describing the expected reward. With each reward observation, the current Q-value $Q_t(a)$ is updated with the following equation:

\[ \begin{equation} Q_{t+1}(a) \leftarrow Q_t(a) + \alpha\color{red}{[r_t(a) - Q_t(a) ]} \end{equation} \] The term in the equation colored red is known as the reward prediction error (RPE). This captures the difference between the expected reward $Q_t(a)$ and the actual reward observation. When this is lower than expected, we update our expectation to be lower, and vice versa.

The amount we update our predictions is governed by the learning rate parameter $\alpha$, which is bounded to the range $[0,1]$. This is often implemented as a free parameter, which is then estimated from data to describe a characteristic of the learner(s). We can estimate $\alpha$ separately for different experimental manipulations to see how learning changes, for example, in contexts of punishment vs. reward (Palminteri et al., 2015) or how group size influences the rate of individual learning in social contexts (Toyokawa et al., 2019).

The Q-learning model also requires some prior initialization of Q-values, which can often be set to $Q_{t=0}=0$. However, there are also settings where optimistic initialization (Sutton & Barto, 2018) allows agents to learn and explore more efficiently. Signatures of optimistic exploration have been found in human behavior (Wilson et al., 2014; Wu et al., 2018) and have been included in models of social learning (Ho et al., 2019).

Exercise 2.1. Can you modify the Q-learning model to be Bayesian? This will allow the model to learn not only the expected reward, but also quantify the level of uncertainty about the outcome of each option. You might be highly confident about the quality of your favorite coffee shop, but a single bad experience at a new restaurant might not be very strong evidence you should never return. One approach is to use a Kalman Filter (Dayan et al., 2000; Speekenbrink & Konstantinidis, 2015; Wu, Schulz, et al., 2022) to learn a distribution over rewards, with the simplifying assumption that each option corresponds to an independent Gaussian distribution and that learning follows linear Gaussian dynamics. Gershman (2015) shows how a Kalman filter provides a Bayesian analogue to the RPE learning updates of the Q-learning model we describe above.

Converting values to actions

In order to convert values to actions, we need a policy $\pi$ defining a probability distribution over actions. Each action $a$ is assigned probability $\pi(a)$, which can be used to either predict or simulate future actions of the agent. One common method is to use a softmax policy, which selects actions proportional to their exponentiated Q-values:

\[ \begin{equation} \pi(a) \propto exp\left(\beta Q(a)\right) \end{equation} \] $\beta$ is the inverse temperature parameter, governing the amount of stochasticity in the policy. When $\beta$ is small, there is more randomness in the policy with higher probability given to lesser-valued options. As $\beta$ grows larger, the policy tend to more strongly favor the option with the highest value. The code below visualizes the softmax policy across a range of different $\beta$ values and at different combinations of Q-values for a simple 2-armed bandit.

softmax <- function(beta, Qvec){
  p <- exp(beta*Qvec)
  p <- p/sum(p) #normalize to sum to 1
  return(p)
}

Qdiff <- seq(-5,5,length.out=100) #Let's fix the value of option 2 to 0 and then modulate the value of option 1 between -5 and 5

#Generate softmax policies for different beta values
softmaxDF <- data.frame()
for (beta in c(0.5, seq(1,4,by=1))){
  sims <- sapply(Qdiff, FUN=function(diff) softmax(beta, c(diff,0))) #Compute softmax across the range of value differences
  simDF <- data.frame(Qdiff = Qdiff, prob1 = sims[1,], prob2=sims[2,], beta = beta) #Put the value difference, choice probabilities, and beta parameter into a data frame
  softmaxDF <- rbind(simDF, softmaxDF)
}
softmaxDF$beta <- factor(softmaxDF$beta)

ggplot(softmaxDF, aes(x = Qdiff, y = prob1, color = beta))+
  geom_line()+
  theme_classic()+
  scale_color_viridis_d()+
  labs(title = "Softmax policy",
       x = expression(Q[1]-Q[2]),
       y = 'Probability of action 1', 
       col = expression(beta))+
  theme(legend.position = c(1,.1), legend.justification = c(1,0))

Softmax policy using a simplified case of a 2-armed bandit. Towards the right side, action $a_1$ has a higher Q-value than the alternative $a_2$, and has a higher probability of being chosen under the softmax policy.

Exercise 2.2. What other policies could you consider? Think about which strategies you tried in Tutorial 1 during the live demo of the individual learning experiment. Did you find that you sometimes choose an option randomly? You could describe that using an $\epsilon$-greedy policy (Sutton & Barto, 2018) Did you tend to explore options you visited the least? Then you could implement a count-based exploration policy (Machado et al., 2020) or uncertainty-directed exploration policy such as upper confidence bound sampling (Auer et al., 2002). Note, that to implement the latter, you will need to have implemented a Bayesian RL model (Exercise 2.1) in order to quantify the uncertainty for each option.

Simulating data

We’re now ready to simulate some data with this Q-learning model paired with a bandit environment we defined in Tutorial 1. We first need to specify some task parameters, such as the number of arms, the means and standard deviations of rewards, number of rounds, and number of trials per round. Then we need to define the model parameters, where we start with some sensible values of $\alpha=.9$, $\beta=1$, and $Q_0=0$. The code block below describes the entire simulation procedure and plots reward curves over each individual round (colored lines) and aggregated over 10 rounds (black line).

#Task parameters for bandit with Gaussian rewards
k <- 10 #number of arms
meanVec <- seq(-10,10, length.out=k) #Payoff means
sigmaVec <- rep(1, k) #Payoff stdevs
T <- 25 #total number of trials
nAgents <- 4
rounds <- 10

banditGenerator <- function(a) {#a is an integer or a vector of integers, selecting one of the 1:k arms of the bandit
  payoff <- rnorm(n = length(a), mean = meanVec[a], sd = sigmaVec[a])
  return (payoff)
}

#Model parameters
alpha <- .9 #learning rate
beta <- 1 #Softmax inverse temperature
Q0 <- Qvec <-  rep(0,k) #prior initialization of Q-values

#Now simulate data for multiple agents over multiple rounds
simDF <- data.frame()
for (a in 1:nAgents){ #loop through agents
  for (r in 1:rounds){ #loop through rounds
    Qvec <- Q0 #reset Q-values
    for (t in 1:T){ #loop through trials
      p <- softmax(beta, Qvec) #compute softmax policy
      action <- sample(1:k,size = 1, prob=p) #sample action
      reward <- banditGenerator(action) #generate reward
      Qvec[action] <- Qvec[action] + alpha*(reward - Qvec[action]) #update q-values
      chosen <- rep(0, k) #create an index for the chosen option
      chosen[action]<- 1 #1 = chosen, 0 = not
      trialDF <- data.frame(trial = t, agent = a, round = r, Q = Qvec, action = 1:k, chosen = chosen, reward = reward)
      simDF <- rbind(simDF,trialDF)
    }
  }
}

saveRDS(simDF, 'data/simChoicesQlearning.Rds')
#Plot results
ggplot(subset(simDF, chosen==1), aes(x = trial, y = reward, color = interaction(agent,round)))+
  geom_line(size = .5, alpha = 0.5)+
  stat_summary(fun = mean, geom='line', color = 'black', size = 1)+
  theme_classic()+
  xlab('Trial') +
  ylab('Reward')+
  ggtitle('Simulated performance')+
  theme(legend.position='none')

Simulated performance of the single Q-learning agent over multiple rounds. Each colored line is the learning curve in a single round, with the black line showing the mean across rounds.

Demo 1: Tweaking individual learning parameters

Simulated behavior is sensitive to the choice of model parameters. Under the following link, you can access an interactive Shiny app to explore how different model parameters influence the simulated learning curves.

Demo 1: Tweaking individual learning parameters

Likelihood function

Beyond only simulating data, we also want to use computational models to describe experimental data collected from subjects. Which model type provides a better description will allow us to test different hypotheses, while the estimated parameters of the winning model should provide an interpretable characterization of behavior.

To fit a model to data, we first need to define a likelihood function:

\[ \begin{equation} P(D|\theta) \end{equation} \] The likelihood function describes the likelihood of the observed data $D$ (i.e., actions made by a subject) given a set of parameters $\theta$ (e.g., $\alpha$ and $\beta$).

Since we are hardly ever modeling a single behavioral data point, but rather a sequence of observations (e.g., over multiple trials of learning and multiple rounds of a task), we need to describe the joint likelihood over all observations under consideration $P(D|\theta) = \prod_t P(d_t|\theta)$, which is the product of the likelihood of each observation $P(d_t|\theta)$. This becomes much easier using logarithms, since we can replace multiplication with summation in log-space and simply sum over the log-likelihoods:

\[ \begin{equation} \log P(D|\theta) = \sum_t \log P(d_t|\theta) \end{equation} \] Note that natural logs $\ln$ are most commonly used rather than base 10 logarithms $\log_{10}$.

The (natural) log of any probability will be negative (since probabilies are always less than or equal to 1). Thus, it’s more convenient to express the fit of a model in terms of the negative log-likelihood (nLL), by simply inverting the sign of the previous equation:

\[ \begin{equation} nLL= -\sum_t \log P(d_t|\theta) \end{equation} \] The nLL will always be a value greater than zero, with smaller values indicating a better fit. Thus, nLL expresses the amount of error or loss in quantifying how well a model (given a set of parameters) fits the data, and is sometimes also called “log loss”.

In practice, we can define a likelihood function (using very similar code to our model simulations) that takes a set of parameters and participant data, and then returns the nLL. The likelihood function iteratively loops through the trials of the task and predicts the next action based on the current policy $\pi_t$. We then take the log probability of the actual action that was selected, and use that to update our running tally of the nLL. Remember that we also need to use the actual action and reward observations to update the value function of the agent in order for the model predictions to evolve following the learning dynamics of our model.

Lastly, we wrap the likelihood function in an optimization function in order to find the set of parameters that minimizes the nLL. By minimizing the loss of the model, we arrive at an estimate of the most likely set of parameters known as the Maximum Likelihood Estimate:

likelihood <- function(params, data, Q0=0){ #We assume that prior value estimates Q0 are fixed to 0 and not estimated as part of the model
  names(params) <- c('alpha', 'beta') #name parameter vec
  nLL <- 0 #Initialize negative log likelihood
  rounds <- max(data$round)
  trials <- max(data$trial)
  for (r in 1:rounds){ #loop through rounds
    Qvec <- rep(Q0,k) #reset Q-values each new round
    for (t in 1:trials){ #loop through trials
      p <- softmax(params['beta'], Qvec) #compute softmax policy
      trueAction <- subset(data, chosen==1 & trial==t & round == r)$action
      negativeloglikelihood <- -log(p[trueAction]) #compute negative log likelihood
      nLL <- nLL + negativeloglikelihood #update running count
      Qvec[trueAction] <- Qvec[trueAction] + params['alpha']*(subset(data, chosen==1 & trial==t & round == r)$reward - Qvec[trueAction]) #update q-values
    }
  }
  return(nLL)
}

#Now let's optimize the parameters
init <- c(1,1) #initial guesses
lower <- c(0,-Inf) #lower and upper limits. We use very liberal bounds here, but you may want to set stricter bounds for better results
upper <- c(1,Inf)

MLE <- optim(par=init, fn = likelihood,lower = lower, upper=upper, method = 'L-BFGS-B', data = subset(simDF, chosen==1 & agent==1) )

Using the data we simulated earlier using $\alpha$ = 0.9 and $\beta$ = 1.0, we arrive at a MLE estimate of $\hat{\alpha}$ = 0.95 and $\hat{\beta}$ = 1.10. This MLE corresponds to a negative log likelihood 109.5659023 for this MLE, quantifying how good of a fit it provides.

Loss landscape

We can also compute nLLs across a range of parameter combinations in order to visualize the loss landscape. The plot below illustrates how some regions of the parameter space produce similar fits to the data.

#Create a grid of plausible parameter combinations
paramCombs <- expand.grid(alpha = seq(0,1, length.out=20), beta = seq(0,20, length.out=20)) 

#Now compute nLLs for each grid combination
nLLs <- sapply(1:nrow(paramCombs), FUN=function(i) likelihood(as.numeric(paramCombs[i,]), subset(simDF, chosen==1 & agent==1))) #This can be a bit slow
paramCombs$nLL <- nLLs #add to dataframe

bestFit <- paramCombs[paramCombs$nLL ==  min(nLLs),] #best fitting combination 

#plot data
ggplot(paramCombs)+
  geom_tile(aes(x = alpha, y = beta, fill = nLL)) +
  geom_contour(aes(x = alpha, y = beta, z = nLL), color = 'white')+
  geom_point(data=data.frame(alpha=alpha, beta=beta, type = 'true'), aes(alpha,beta, shape = type), size = 4, color = 'red')+ # true value
  geom_point(data=data.frame(alpha=bestFit$alpha, beta=bestFit$beta, type = 'MLE'), mapping=aes(alpha,beta, shape = type), size = 4, color = 'red')+ # best fitting value
  scale_fill_viridis_c('nLL', direction = -1)+
  labs(title = "Loss landscape",
       x = expression(alpha),
       y = expression(beta))+
  scale_shape_manual(values= c(4, 0), name = '')+
  theme_classic()+
  scale_x_continuous(expand = c(0, 0))+
  scale_y_continuous(expand = c(0, 0))

Loss landscape over different parameter combinations. The MLE does not not always provide a close match to the true generating parameters, and the white contour lines show areas of the parameter space providing similar fits.

Learning from social information

Having described how to model a simple individual learner, let us finally describe social learning. We don’t attempt to provide an exhaustive list of potential social learning mechanisms (Heyes, 2016; see Hoppitt & Laland, 2013; Wu, Vélez, et al., 2022 for more details), but simply highlight a few models that have recently be applied to human social learning.

Imitating actions

Arguably, one of the most simple social learning strategies is to copy the action of others. We can define a simple frequency-dependent copying (FDC) policy (McElreath et al., 2005; Toyokawa et al., 2019) that selects actions proportional to the number of other agents observed performing the same action:

\[ \begin{equation} \pi_{FDC}(a) \propto f(a)^\theta \end{equation} \] where $\theta$ is a conformity parameter. Higher levels of $\theta$ correspond to stronger bias towards copying the majority.

Demo 2 - Imitation

Under the following link, you can access an interactive Shiny app to explore different parameters of both individual learning agents and social imitation agents. Observe how the individual learning parameters influence the imitators, but not the other way around.

Demo 2: Imitation

Importantly, FDC (like other forms of imitation) has frequency-dependent fitness, meaning how well it performs depends on proportion of social learners in the population. Typically, imitation is most effective when there few other imitators in the population. But when most peers are also imitating others, then we can observe maladaptive information cascades, where imitators are simply imitating other imitators. We can illustrate this pattern of behavior using our social bandit task, allowing us to reproduce a key theoretical result known as Rogers (1988) pardox:

#Simulation function
simSocialData <- function(nAgents, rounds=10, trials = 25, k = 10, alpha = .9, beta = 1, Q0 = 0, theta = 5){
  socialSimDF <- data.frame() #Initialize dataframe
  for (socLearners in c(seq(1,nAgents-1,1))){ #loop through different ratios of social learners
    indLearners <- nAgents - socLearners #number of individual learners
    group <- c(rep('Ind', indLearners), rep('Soc', socLearners)) #group makeup
    for (r in 1:rounds){ #loop through rounds
      set.seed(nrow(socialSimDF)) #set the seed for each new game
      Qmat <- matrix(Q0,  nrow = nAgents, ncol = k) #reset Q-values; this variable is now a matrix with one row per agent (unused for social learners)
      socialObs <- rep(NA, nAgents) #social observations
      for (t in 1:trials){ #loop through trials
        for (agent in 1:nAgents){ #loop through agents
          if (group[agent] == 'Ind'){
            #Individual learners
            p <- softmax(beta, Qmat[agent,]) #compute softmax policy; we assume for now that all individual agents have the same parameter combo as before
            action <- sample(1:k,size = 1, prob=p) #sample action
            reward <- banditGenerator(action) #generate reward
            Qmat[agent,action] <- Qmat[agent,action] + alpha*(reward - Qmat[agent,action]) #update q-values
            chosen <- rep(0, k) #create an index for the chosen option
            chosen[action]<- 1 #1 = chosen, 0 not
            socialObs[agent] <- action
            #save data
            trialDF <- data.frame(trial = t, agent = agent, type = group[agent], ratio = socLearners, round = r, reward = reward)
            socialSimDF <- rbind(socialSimDF,trialDF)
          }else if (group[agent] == 'Soc'){ 
            #social learning agent
            if (t==1){ #first trial is random, since there are no social observations
              p = rep(1/k, k) #uniform probabilities
            }else{
              socialFreq <- table(factor(socialObs[-agent], levels = 1:k)) + .0001 #Frequency of each action + a very small number to avoid divide by zero
              p <- socialFreq^theta #compute probabilities
              p <- p/sum(p) 
            }
            action <- sample(1:k,size = 1, prob=p) #sample action according to policy
            reward <- banditGenerator(action) #generate reward
            chosen <- rep(0, k) #create an index for the chosen option
            chosen[action]<- 1 #1 = chosen, 0 not
            socialObs[agent] <- action
            #save data
            trialDF <- data.frame(trial = t, agent = agent, type = group[agent], ratio = socLearners, round = r, reward = reward)
            socialSimDF <- rbind(socialSimDF,trialDF)
          }
        }
      }
    }
  }
  return(socialSimDF)
}

#socialsimDF <- simSocialData(nAgents = 10, rounds = 100, trials = 25) #commented out because this is very slow. Load it from precomputed data instead
socialsimDF <- readRDS('data/socialSimsTutorial2.Rds')

perfDF <- socialsimDF  %>% group_by(type,  ratio) %>% summarize(performance = mean(reward))
popDF <- socialsimDF  %>% group_by(ratio) %>% summarize(performance = mean(reward))
popDF$type = 'Pop'

combinedDF <- rbind(perfDF,popDF )
combinedDF$type <- factor(combinedDF$type, levels = c('Ind', 'Soc', 'Pop'))
#Plot results
ggplot(subset(combinedDF, ratio>0 & ratio<8),  aes(x = ratio/8, y = performance, color = type))+
  geom_line(aes(linetype=type))+
  theme_classic()+
  labs( x =  "% Social Learners", y = "Avg. Performance")+
  scale_color_manual(values = c("#d7191c","#2b83ba",  "black"), name = "")+
  scale_linetype_manual(values = c("dashed", "dotted", "solid"), name = "")+
  ylab('Average Rewards')+
  scale_x_continuous(labels=scales::percent)+
  ggtitle("Rogers' Paradox")+
  theme(legend.position = c(0.1,0.1), legend.justification = c(0,0), legend.background=element_blank())

Rogers’ paradox is primarily concerned with the evolution of social learning, but it also illustrates how social learning via imitation has frequency-dependent fitness. Social learning (Soc) can outperform individual learning (Ind) when the proportion of social learners are low. But as the proportion of social learners increases, the fitness for social learners and the entire population collapses.

Combining imitation and value-learning

While the pure imitation agent we described previously can sometimes outperform individual learning, it lacks any learning of it’s own. You may have noticed from the code block above that the FDC agent maintains no beliefs about rewards in the environment. However, we can define a hybrid decision-biasing (DB) agent (Najar et al., 2020) that combines Q-learning (or any other RL model) with FDC using a mixture policy:

\[ \begin{equation} \pi_{DB} = (1-\gamma)\pi_{Ind} + \gamma\pi_{FDC} \end{equation} \] Here, we use $\pi_{Ind}$ to refer to the softmax policy of the Q-learning agent to avoid confusion, and use $\gamma$ as a mixture weight (bounded between 0 and 1) to describe how much social imitation to perform relative to individual learning.

#Simulator for decision-biasing agent
dbSimulator <- function(gammaVec, nAgents = 4, trials = 25, rounds = 1, alpha = .9, beta = 1, Q0 = 0){
  dbDF <- data.frame() #initialize dataframe
  for (r in 1:rounds){ #loop through rounds
     Qmat <- matrix(Q0,  nrow = nAgents, ncol = k) #Initialize Q-values in a matrix with one row per agent
     socialObs <- rep(NA, nAgents) #social observations
      for (t in 1:trials){ #loop through trials
          for (agent in 1:nAgents){ #loop through agents
              p_ind <- softmax(beta, Qmat[agent,]) #compute individual learning policy; we assume for now that all individual agents have the same parameters as before
              if (t==1){ #first trial has no social
                p_fdc = rep(1/k, k) #uniform probabilities
              }else{
                socialFreq <- table(factor(socialObs[-agent], levels = 1:k)) + .0001 #Frequency of each socially observed action + a very small number to avoid divide by zero
                p_fdc <- socialFreq/sum(socialFreq) #compute probabilities
              }
              p_db <- ((1-gammaVec[agent]) * p_ind) + (gammaVec[agent] * p_fdc) #mixture policy
              action <- sample(1:k,size = 1, prob=p_db) #sample action
              reward <- banditGenerator(action) #generate reward
              Qmat[agent,action] <- Qmat[agent,action] + alpha*(reward - Qmat[agent,action]) #update q-values
              chosen <- rep(0, k) #create an index for the chosen option
              chosen[action]<- 1 #1 = chosen, 0 not
              socialObs[agent] <- action #update social observation vector
              #save data
              trialDF <- data.frame(trial = t, round = r, agent = agent,  reward = reward)
              dbDF <- rbind(dbDF,trialDF)
              }
       }
  }
 
  return(dbDF)
}

gammaVec <-runif(4) #replace with 
dbDF <- dbSimulator(gammaVec) #run simulation


ggplot(dbDF, aes(x = trial, y = reward, color = factor(agent)))+
  geom_line()+
  theme_classic()+
  scale_color_manual(values = c("#E69F00", "#56B4E9", "#009E73", "#F0E442"), name = 'Agent')+
  theme(legend.position=c(1,0.1), legend.justification = c(1,0))+
  labs(x = 'Trial', y = 'Reward', title = 'Decision-biasing agents')

Demo 3 - Decision-Biasing

You can use the following link to test our different parameter combinations for the decision-biasing agent. Which values of $\gamma$ typically produce the best results?

Demo 3: Decision-Biasing

Value shaping

Instead of only using social information to influence one’s behavioral policy, we can also describe social learning strategies where social information is integrated higher up in the decision-making hierarchy. Social information can also shape the value representations we form and the beliefs we hold about the structure of the environment or which outcomes are intrinsically rewarding (Wu, Vélez, et al., 2022).

Here, we define a very simple form of social value learning known as value-shaping (VS), where we can simply add a value bonus proportional to the popularity of an option (Biele et al., 2011; Najar et al., 2020):

\[ \begin{equation} Q(a) \leftarrow Q(a) + \eta f(a) \end{equation} \] As before, $f(a)$ is the frequency of other agents that select action $a$ and $\eta$ is a free parameter representing how large of a social bonus to add.

#Simulate data
vsSimulator <- function(etaVec, nAgents = 4, trials = 25, rounds = 1, alpha = .9, beta = 1, Q0 = 0){
  vsDF <- data.frame() #initialize dataframe
  for (r in 1:rounds){ #loop through rounds
    Qmat <- matrix(Q0,  nrow = nAgents, ncol = k) #Initialize Q-values in a matrix with one row per agent
    socialObs <- rep(NA, nAgents) #social observations
    for (t in 1:trials){ #loop through trials
        for (agent in 1:nAgents){ #loop through agents
            socialFreq <- table(factor(socialObs[-agent], levels = 1:k)) + .0001 #Frequency of each socially observed action + a very small number
            Qmat[agent,] <- Qmat[agent,] + (etaVec[agent]*socialFreq) #add value bonus
            p <- softmax(beta, Qmat[agent,]) #compute policy
            action <- sample(1:k,size = 1, prob=p) #sample action
            reward <- banditGenerator(action) #generate reward
            Qmat[agent,action] <- Qmat[agent,action] + alpha*(reward - Qmat[agent,action]) #update q-values
            chosen <- rep(0, k) #create an index for the chosen option
            chosen[action]<- 1 #1 = chosen, 0 not
            socialObs[agent] <- action #update social observation vector
            #save data
            trialDF <- data.frame(trial = t, round = r,agent = agent,  reward = reward, action = action, eta = etaVec[agent])
            vsDF <- rbind(vsDF,trialDF)
        }
      } 
    }
  return(vsDF)
}

etaVec <-rexp(4) #value bonus parameter; let's sample this from a exponential distribution
vsDF <- vsSimulator(etaVec) #run simulation

#We can also run this for several rounds of the task, which will be necessary for model fitting. The commented out lines of code will be used to simulate data for Tutorial 3
#vsDF <- vsSimulator(etaVec, rounds = 10) #run simulation
#saveRDS(vsDF,'data/simChoicesVSagent.Rds') #save data

ggplot(vsDF, aes(x = trial, y = reward, color = factor(agent)))+
  geom_line()+
  theme_classic()+
  scale_color_manual(values = c("#E69F00", "#56B4E9", "#009E73", "#F0E442"), name = 'Agent')+
  theme(legend.position=c(1,0.1), legend.justification = c(1,0))+
  labs(x = 'Trial', y = 'Reward', title = 'Value-shaping agents')

Demo 4 - Heterogenous groups

For the last demo of this tutorial, you can use the Shiny app linked below to simulate heterogenous groups of individual learners, decision-biasing agents, and value-shaping agents.

Demo 4: Heterogenous groups

Exercise 2.3. The social learning models we have described treat all other individuals as the same. These are known as conformity biased strategies, since they only care about the frequency of behaviors in the population but don’t distinguish between who is performing each action. Yet social learning is often selective, for instance, in preferentially learning from successful or prestigious individuals. How could you modify decision-biasing and value shaping to be selectively influenced by some individuals?

Theory of Mind

So far, we have described some of the simplest social learning mechanisms. With frequency-dependent imitation or decision-biasing an agent is able to directly imitate socially observed behaviors. With value shaping, social information translates rather directly into a value bonus for the observed action without needing a deal of interpretation.

However, this neglects an important aspect of what makes human social learning so powerful, which is our ability to “unpack” observed actions into imputed mental states using Theory of Mind (ToM). We can describe ToM as a form of inference, where observed behaviors can are used to infer a variety of different mental representations (red arrows).

Varieties of Theory of Mind Inference. Adapted from Wu, Vélez, et al. (2022)

Value inference

Actions can be used to infer the other person’s values or goals, such as when we recognize an individual sprinting across the street is attempting to catch a bus. Or when we observe someone’s food choices at a buffet and realize they are quite likely to be a vegetarian. Performing these inferences allow us to use social information more flexibly and tailored our own personal circumstances. For instance, we might flexibly decide whether to also chase after the sprinting individual based on whether we think it’s the same bus we also need. Or we might imitate some but not all of the food choices from another person with different dietary restrictions.

One formal framework for describing value inference is using Inverse Reinforcement Learning (IRL; Sutton & Barto (2018); Jara-Ettinger (2019)). Using Bayesian inference, we can compute a posterior distribution about the Values (e.g., Q-values) that gave rise to the observed Actions:

\[ P(V|A) \propto P(A|V)P(V) \] To perform this computationally, we need to integrate over all possible value functions $P(V)$ and evaluate the likelihood at each possible value function $P(A|V)$. This type of computation is clearly intractable for almost all problems, since there are an infinite number of possible value functions someone could have. Nevertheless, it provides a rational framework that has been used to uncover inductive biases that simplify the required computations. For instance, the principle of efficient action is a bias that assumes other people are acting in the most efficient manner (Gergely & Csibra, 2003). This allows us to constrain the likelihood to be a value-maximizing policy, although it still does not necessarily rescue the computations of IRL from intractability.

Belief inference

Beyond inferring values, we can also use the same IRL framework to decompose actions into even more primitive building blocks of representation. Social information can be used to make inferences about someone’s Beliefs about the causal structure of the world (“she thinks local markets have fresher produce”), or about their intrinsic Reward specification (“she loves the taste of broccoli”). We can use a similar formalism as above to describe this form of inference:

\[ P(B,R|A) \propto P(A|B,R)P(B,R) \] The more we use ToM to decompose actions into their primitives, the more flexibility it affords. By inferring values, beliefs, and intrinsic reward representations from observed actions, we can better integrate social information into our own individual circumstances, goals, and preferences. However, this additional flexibility comes with substantial computational costs, and it is an exciting line of research to better understand how humans navigate this trade-off.

Scaling up to more complex problems

While the learning mechanisms we have presented are very simple, the same principles can be used in arbitrarily complex problems.

Collective foraging task. (a) Birds-eye perspective of a single round. (b) Screenshot from the experiment. (c) Automated transcription of each player’s field of view (FOV) used to model social visibility. (d) Task design. (e) Different reward structures, where only smooth environments afford benefits for social imitation. (f) Agent-based simulations show a benefit for success-biased social learning over asocial learning in smooth, but not random environments, whereas unbiased social learning performs poorly in both. Adapted from Wu et al. (2023)

For instance, Wu et al. (2023) uses an immersive collective foraging task implemented in the Minecraft game engine (see Wu et al., 2021 for earlier pilot results), where a realistic field of view creates attentional trade-offs for social learning (looking at other players for imitation comes at the cost of slower individual foraging). The task provides rich spatial and visual dynamics, which can nevertheless be incorporated into a computational model by sequentially predicting each of the $k$ blocks participants destroy:

\[ P(\text{Choice}_{k+1}) \propto \exp(\mathbf{f}_k\cdot \mathbf{w}) \] This is modeled as a softmax choice function over a linear combination of features $\mathbf{f}$ times weights $\mathbf{w}$. The features capture hypotheses about individual and social learning, which incorporate rich spatial and visual dynamics of the task, while model weights are hierarchically estimated using Bayesian methods in STAN (see Tutorial 4) in order to control for individual and group-level variability.

Computational modeling of choices in a dynamic and immersive environment.

Evolutionary simulations

Another important computational tool for studying social learning is the use of evolutionary simulations. Since social learning has frequency-dependent fitness (Rogers, 1988), the best strategy to use depends on what others in the population are deploying. Thus, in order to determine the best normative strategy, it is often helpful to use evolutionary simulations (Tump et al., 2019; Witt et al., 2023):

Initialize a population of agents
Simulate performance on the task
Select agents to seed the next generation (e.g., based on performance)
Add mutation (change agent type, modify parameters)
Repeat until convergence

Below, we provide some code to run an evolutionary simulation over 500 generations, starting from a group of purely individual learners and seeing if our simple social learning agents can invade. The results can often be quite noisey, and usually need to be aggregated over multiple replications for robustness.

#Define simulation parameters
popSize <- 100 #population size of each generation
groupSize <- 10 #number of agents playing each game
indPop <- popSize #start with only individual learners to see if social learning evolves
socPop <- 0
initial <- cbind( #initial agent parameters, stored in a matrix
    c(rep(0, indPop), rep(1, socPop)), #0 = ind agent, 1 = soc agent
    c(runif(indPop), rep(NA, socPop)), #alpha (only for ind agents), sampled from a uniform distribution
    c(rexp(indPop), rep(NA, socPop)), #beta (only for ind agents), sampled from an exponential distribution
    c(rep(NA,indPop), rexp(socPop)) #theta (only for soc agents), sampled from an exponential distribution 
  )
Tsteps <- 25 #Trials in each game
k<- 10 #arms of the bandit
typeMutation <- .05 #rate of type mutation: ind <--> soc
paramMutation <- .1 #rate of parameter mutation
paramMutationVariance <- .1 #How much parameters mutate

#Evolution function
doTheEvolution <- function(initialPop, generations = 100){
  outputList = list() #Store the outputs
  outputList[[1]] <- initialPop
  #Loop through each generation
  for (gen in 2:generations){
    pop <- outputList[[(gen-1)]] #current population
    winners <- rep(0, popSize) #Initialize winners for each game
    #Extract agents
    indAgents <- which(pop[,1]==0) #extract agents
    socAgents <- which(pop[,1]==1)
    #Now sample agents to play games
    agentAssignments <- matrix(NA, popSize, groupSize)
    probs <- rep(1/popSize, popSize) #sample probs
    for (iter in 1:popSize){
      ids<- sample(1:popSize, groupSize,prob= probs, replace = F) #Which agents get selected for each game?
      probs[ids]<- probs[ids]/groupSize #reduce sampling probability of sampled agents for more even sampling
      agentAssignments[iter,] <- ids
    }
    #Compute loop through games
    for (i in 1:popSize){
      set.seed(i) #set seed for each new game
      Qmat <- matrix(0,  nrow = groupSize, ncol = k) #initialize Q matrix
      socialObs <- rep(NA, groupSize) #social observations
      gameScore <- rep(0, groupSize) #game specific score
      for (t in 1:Tsteps){ #loop through timesteps
        for (j in 1:groupSize){#j is the game specific agent index
          agent <- agentAssignments[i,j]
          if (agent %in% indAgents){ #individual learning 
            #extract agent parameters
            alpha <- pop[agent,2]
            beta <- pop[agent,3]
            action <- sample(1:k,size = 1, prob=p) #sample action
            reward <- banditGenerator(action) #generate reward
            gameScore[j] <- gameScore[j] + reward #update score
            Qmat[j,action] <- Qmat[j,action] + alpha*(reward - Qmat[j,action]) #update q-values
            socialObs[j] <- action #add to social obs
          }
          else if (agent %in% socAgents){ #social learning agent
            theta <- pop[agent,4]
            if (t==1){ #first trial is random, since there are no social observations
              p = rep(1/k, k) #uniform probabilities
            }else{
              socialFreq <- table(factor(socialObs[-j], levels = 1:k)) + .0001 #Frequency of each action + a very small number to avoid divide by zero
              p <- socialFreq^theta #compute probabilities
              p <- p/sum(p) 
              action <- sample(1:k,size = 1, prob=p) #sample action
              reward <- banditGenerator(action) #generate reward
              gameScore[j] <- gameScore[j] + reward #update score
              socialObs[j] <- action #add to social obs
            }
          }
        }
      }
      #Tournament selection: choose the winner of each game and reseed new generation
      winners[i] <- agentAssignments[i,which.is.max(gameScore)] #record the winner of the game
    }
    newPop <- pop[winners,]  #seed new population
    #Mutations
    mutateType <-runif(popSize)<typeMutation 
    mutateParams <- runif(popSize)< paramMutation 
    for (player in 1:popSize){
      if (mutateType[player]==TRUE){ #Mutate learning type
        if (newPop[player,1]==0){ #convert from individual to social learner
          newPop[player,] <- c(1, NA, NA, rexp(1, rate = 1)) #Flip learning type and sample new parameters
        }else{ #convert from social to individual
          newPop[player,] <-c(0,runif(1), rexp(1), NA) #flip type and resample new parameters
        }
      }
      if (mutateParams[player]==TRUE) {#Mutate parameters
        if (newPop[player,1]==0){ #Modify individual params
          newPop[player,c(2,3)] <- newPop[player,c(2,3)] + rnorm(2,0,sqrt(paramMutationVariance))
          newPop[player,2] <-  ifelse(newPop[player,2] <=0 | newPop[player,2] >=1, runif(1), newPop[player,2]) #resample if below 0 or out of 1
          newPop[player,3] <-  ifelse(newPop[player,3] <=0, rexp(1), newPop[player,3])
        }else{ #Modify social params
          newPop[player,4] <- newPop[player,4] + rnorm(1,0,sqrt(paramMutationVariance))  #modify social theta 
          if (newPop[player,4]<=0){ #if negative or 0
            newPop[player,4] <-  rexp(1) # resample
          }
        }
      }
    }
    outputList[[gen]]<- newPop
  }
  return(outputList)
}

#Run simulations
outputList <- doTheEvolution(initial, generations = 500)

#Convert output to dataframe
evoSimDF <-rbindlist(lapply(1:length(outputList), function(i) data.frame(outputList[[i]], generation = rep(i, popSize))))
colnames(evoSimDF) <- c('agentType', 'alpha', 'beta', 'theta', 'generation')
evoSimDF$agentType <- factor(evoSimDF$agentType, labels = c('Ind', 'Soc'))

#What ratio of agents evolve?
agentRatio <- evoSimDF %>% group_by(generation) %>% summarize(Ind = mean(agentType=='Ind'), Soc = mean(agentType == 'Soc'))
agentRatio <- agentRatio %>% pivot_longer(cols = -generation, names_to = 'AgentType', values_to = 'Prop')

p1 <- ggplot(agentRatio,  aes(x = generation, y = Prop, color = AgentType))+
  geom_line()+
  theme_classic()+
  labs(x ="Generation", y = "Proportion")+
  scale_color_manual(values = c("#d7191c","#2b83ba"), name = "")+
  theme(legend.position = c(1,1), legend.justification = c(1,1), legend.background=element_blank())

#How do parameters evolve?
paramDF <- evoSimDF %>% group_by(generation) %>% summarize(alpha = mean(alpha, na.rm = TRUE), beta = mean(beta, na.rm = TRUE), theta = mean(theta, na.rm = TRUE))
paramDF <- paramDF %>% pivot_longer(cols = -generation, names_to = 'param')

p2 <- ggplot(paramDF,  aes(x = generation, y = value, color = param))+
  geom_line()+
  theme_classic()+
  labs(x ="Generation", y = "Parameter Value")+
  scale_color_manual(values = c("#E69F00", "#009E73", "#56B4E9"), name = "")+
  theme(legend.position = c(0.0,1), legend.justification = c(0,1), legend.background=element_blank())

plot_grid(p1,p2, labels = 'auto')

$Evolutionary simulations.$

Evolutionary simulations.

References

Acerbi, A., Mesoudi, A., & Smolla, M. (2020). Individual-based models of cultural evolution. A step-by-step guide using r.

Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2), 235–256.

Biele, G., Rieskamp, J., Krugel, L. K., & Heekeren, H. R. (2011). The neural basis of following advice. PLoS Biology, 9(6), e1001089.

Dayan, P., Kakade, S., & Montague, P. R. (2000). Learning and selective attention. Nature Neuroscience, 3, 1218–1223.

Fan, J. E., Hawkins, R. D., Wu, M., & Goodman, N. D. (2020). Pragmatic inference and visual abstraction enable contextual flexibility during visual communication. Computational Brain & Behavior, 3, 86–101.

Galesic, M., Barkoczi, D., Berdahl, A. M., Biro, D., Carbone, G., Giannoccaro, I., Goldstone, R. L., Gonzalez, C., Kandler, A., Kao, A. B., et al. (2023). Beyond collective intelligence: Collective adaptation. Journal of the Royal Society Interface, 20(200), 20220736.

Gergely, G., & Csibra, G. (2003). Teleological reasoning in infancy: The naıve theory of rational action. Trends in Cognitive Sciences, 7(7), 287–292.

Gershman, S. J. (2015). A unifying probabilistic view of associative learning. PLoS Computational Biology, 11(11), e1004567.

Heyes, C. (2016). Who knows? Metacognitive social learning strategies. Trends in Cognitive Sciences, 20(3), 204–213.

Ho, M. K., Cushman, F., Littman, M. L., & Austerweil, J. L. (2019). People teach with rewards and punishments as communication, not reinforcements. Journal of Experimental Psychology: General, 148(3), 520.

Hoppitt, W., & Laland, K. N. (2013). Social learning. In Social learning. Princeton University Press.

Jara-Ettinger, J. (2019). Theory of mind as inverse reinforcement learning. Current Opinion in Behavioral Sciences, 29, 105–110.

Machado, M. C., Bellemare, M. G., & Bowling, M. (2020). Count-based exploration with the successor representation. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 5125–5133.

McElreath, R., Lubell, M., Richerson, P. J., Waring, T. M., Baum, W., Edsten, E., Efferson, C., & Paciotti, B. (2005). Applying evolutionary models to the laboratory study of social learning. Evolution and Human Behavior, 26(6), 483–508.

Najar, A., Bonnet, E., Bahrami, B., & Palminteri, S. (2020). The actions of others act as a pseudo-reward to drive imitation in the context of social reinforcement learning. PLoS Biology, 18(12), e3001028.

Palminteri, S., Khamassi, M., Joffily, M., & Coricelli, G. (2015). Contextual modulation of value signals in reward and punishment learning. Nature Communications, 6(1), 1–14.

Rogers, A. R. (1988). Does biology constrain culture? American Anthropologist, 90(4), 819–831.

Speekenbrink, M., & Konstantinidis, E. (2015). Uncertainty and exploration in a restless bandit problem. Topics in Cognitive Science, 7(2), 351–367.

Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (Second). MIT press.

Toyokawa, W., Whalen, A., & Laland, K. N. (2019). Social learning strategies regulate the wisdom and madness of interactive crowds. Nature Human Behaviour, 3(2), 183–193.

Tump, A. N., Wu, C. M., Bouhlel, I., & Goldstone, R. L. (2019). The evolutionary dynamics of cooperation in collective search. In A. K. Goel, C. M. Seifert, & C. Freksa (Eds.), Proceedings of the 41st Annual Conference of the Cognitive Science Society (pp. 883--889). Cognitive Science Society.

Vélez, N., Chen, A. M., Burke, T., Cushman, F. A., & Gershman, S. J. (2023). Teachers recruit mentalizing regions to represent learners’ beliefs. Proceedings of the National Academy of Sciences, 120(22), e2215015120.

Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3), 279–292.

Wilson, R. C., Geana, A., White, J. M., Ludvig, E. A., & Cohen, J. D. (2014). Humans use directed and random exploration to solve the explore–exploit dilemma. Journal of Experimental Psychology: General, 143(6), 2074.

Witt, A., Toyokawa, W., Lala, K., Gaissmaier, W., & Wu, C. M. (2023). Social learning with a grain of salt. In M. Goldwater, F. Anggoro, B. Hayes, & D. Ong (Eds.), Proceedings of the 45th Annual Conference of the Cognitive Science Society. Cognitive Science Society. https://doi.org/10.31234/osf.io/c3fuq

Wu, C. M., Deffner, D., Kahl, B., Meder, B., Ho, M. H., & Kurvers, R. H. (2023). Visual-spatial dynamics drive adaptive social learning in immersive environments. bioRxiv. https://doi.org/10.1101/2023.06.28.546887

Wu, C. M., Ho, M. K., Kahl, B., Leuker, C., Meder, B., & Kurvers, R. H. (2021). Specialization and selective social attention establishes the balance between individual and social learning. In T. Fitch, C. Lamm, H. Leder, & K. Teßmar-Raible (Eds.), Proceedings of the 43rd Annual Conference of the Cognitive Science Society (pp. 1921–1927). Cognitive Science Society. https://doi.org/10.1101/2021.02.03.429553

Wu, C. M., Schulz, E., Pleskac, T. J., & Speekenbrink, M. (2022). Time pressure changes how people explore and respond to uncertainty. Scientific Reports, 12, 1–14. https://doi.org/https://doi.org/10.1038/s41598-022-07901-1

Wu, C. M., Schulz, E., Speekenbrink, M., Nelson, J. D., & Meder, B. (2018). Generalization guides human exploration in vast decision spaces. Nature Human Behaviour, 2(12), 915–924.

Wu, C. M., Vélez, N., & Cushman, F. A. (2022). Representational exchange in human social learning: Balancing efficiency and flexibility. In I. C. Dezza, E. Schulz, & C. M. Wu (Eds.), The Drive for Knowledge: The Science of Human Information-Seeking. Cambridge University Press.

Tutorial 2 - Models of social and individual learning

Brisk Introduction to Reinforcement Learning

Learning a value function

Converting values to actions

Simulating data

Demo 1: Tweaking individual learning parameters

Likelihood function

Loss landscape

Learning from social information

Imitating actions

Demo 2 - Imitation

Combining imitation and value-learning

Demo 3 - Decision-Biasing

Value shaping

Demo 4 - Heterogenous groups

Theory of Mind

Value inference

Belief inference

Scaling up to more complex problems

Evolutionary simulations

Other forms of social learning

References