packages <- c('tidyverse', "vembedr", 'formatR')
#invisible(lapply(packages, install.packages, character.only = TRUE)) #Install packages if any are missing
invisible(lapply(packages, require, character.only = TRUE)) #Load all packages
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
set.seed(1234) #set seed for reproducibility
Welcome to tutorial 1 of COSMOS Konstanz. These code notebooks are designed to be self-contained learning tools for building on the concepts taught during the summer school, but should also be accessible for anyone from around the world interested in these topics.
The code notebooks will introduce theoretical concepts, provide basic mathematical formulas, and helpful code snippets that you can use in your own projects. There are also links to demonstrations of simple online experiments and interactive Shiny apps that will allow you to play around with different experimental paradigms, computational models, and simulation frameworks. These notebooks will also contain exercises, which offer suggestions for how to go beyond the materials.
Multi-armed bandit problem
Our first goal is to design a simple learning environment for studying social learning and collective behavior. We start with a multi-armed bandit problem (Robbins, 1952; Steyvers et al., 2009; Wald, 1947) as one of the simplest tasks commonly used to study learning in humans, animals, and artificial agents.
The term “multi-armed bandit” is simply a colorful metaphor for a row of slot machines in a casino. Since slot machines are notorious for stealing your money, they’re sometimes known as a “one-armed bandit”, in reference to the lever used to play the machine.
Each of the \(k \in [1,\ldots, K]\) slot machines refer to one of \(K\) different available decisions or options, each yielding a reward from an initially unknown payoff function. We can easily map this metaphor onto a wide range of real-world learning problems, such as choosing what to order at a restaurant, choosing where to forage for food, or choosing which crops to plant. The goal is thus to maximize your payoffs, by sequentially playing one of the options at each time \(t\).
In order to do so, one must balance between exploring new options to acquire information about payoffs (to inform future decisions) and exploiting options known to have high-payoffs (in order to acquire immediate rewards). This is known as the exploration-exploitation dilemma (Mehlhorn et al., 2015) and every living organism capable of learning must achieve an efficient balance to these dual goals. Yet, given a finite time horizon for learning, optimal solutions are generally unavailable (Aplin et al., 2015; Deffner et al., 2020; Sasaki et al., 2013).
2-armed Gaussian bandits
Let’s start from the simplest case and program a 2-armed bandit with Gaussian rewards. Imagine you are a farmer, and can choose to either plant one of two crops: wheat vs. potato. At the end of the growing season, you then earn a reward proportional to how well your crop performed. However, one of the crops may produce reliably better yields, and it is up to the farmer to figure out which crop is better.
Year after year, there may be variations in yield for a given crop, which we can model using Gaussian noise. Thus, we can describe the reward for option \(k\) as a sample from a Gaussian distribution (also known as a Normal distribution, hence the fancy \(\mathcal{N}\)):
\[ r_k \sim \mathcal{N}(\mu_k, \sigma_k^2) \]
We can now describe a payoff function for each option in terms of a mean or expected reward \(\mu_k\) and the variability \(\sigma_k^2\) (expressed as a variance). Note that the variability can also be expressed in terms of a standard deviation as \(\sigma_k = \sqrt{\sigma_k^2}\).
(Note: people using MacOS Ventura may have issues rendering the equations due to changes in the font library. If you see a box instead of a fancy N, then you can fix the issue using right-click > Math Settings > Math Renderer > “Common HTML”).
Generative model of the environment
Let’s now program a generative model of the bandit task. By defining the number of options, and providing the expected reward and variability for each option, we can define a function that takes an action and returns a reward. We provide the function in the code block below and generate payoffs for 25 random choices.
K = 2 #number of options
meanVec <- runif(n=K, min = -10, max = 10) #Payoff means, which we sample from uniform distribution between -10 to 10
sigmaVec <- rep(1, K) #Variability of payoffs (as a standard deviation), which we assume is the same for all options.
banditGenerator <- function(k) {#k is an integer or k vector of integers, selecting one of the 1:K arms of the bandit
payoff <- rnorm(n = length(k), mean = meanVec[k], sd = sigmaVec[k])
return (payoff)
}
actionSequence <- sample(1:K, size = 25, replace = TRUE) # select 25 random actions
payoffs <- banditGenerator(actionSequence)
df <- data.frame(action = actionSequence, payoff = payoffs)
knitr::kable(df)
action | payoff |
---|---|
2 | 3.8159985 |
2 | 0.7586613 |
1 | -8.3533680 |
2 | 2.4643047 |
1 | -7.0206883 |
1 | -8.3729508 |
1 | -6.8577509 |
2 | 2.8216237 |
2 | 2.7562503 |
2 | 2.4509950 |
2 | 2.4083578 |
1 | -7.0019557 |
2 | 1.9492492 |
2 | 2.4573833 |
2 | 2.4558480 |
1 | -7.0476604 |
2 | 3.4755511 |
2 | 0.7164596 |
2 | 0.2416400 |
2 | 2.9891610 |
2 | 2.4674153 |
2 | 2.6085539 |
2 | 3.6877424 |
2 | 3.2287656 |
1 | -7.6778111 |
Let’s now visualize the environment, where we use a large number of samples (n=10000) to paint an accurate picture of the true generative reward structure (vertical density plot), and plot that against a small number (n=25) of sampled payoffs (dots):
source('codeSnippets/geom_flat_violin.R')
# Use samples to approximate the reward distribution
sampleDF <- data.frame()
nsamples = 10000 #Samples used to approximate the normal distribution
plotsamples = 25 #samples to plot as dots
rewardSamples <- c(sapply(1:K, FUN=function(k) banditGenerator(rep(k, nsamples))))#use an apply function to simulate multiple samples of each arm
sampleDF <- data.frame(option = rep(1:K, each = nsamples), payoff=rewardSamples, nsample = rep(1:nsamples, K)) #create a dataframe for plotting
ggplot(sampleDF, aes(x = factor(option), y = payoff, fill = factor(option)))+
geom_flat_violin(position = position_nudge(x = 0.3, y = 0), alpha = .5) +
geom_jitter(data = subset(sampleDF, nsample<=plotsamples), aes(y = payoff, color = factor(option)),width = 0.2 )+ #Plot only a subset of points show how well a limited number of samples approximates the true distribution
ylab('Payoff')+
xlab('Option')+
theme_classic() +
scale_fill_viridis_d()+
scale_color_viridis_d()+
ggtitle('Payoff conditions')+
theme(legend.position ='none')
Scaling up
Of course, a 2-armed bandit is quite a simple learning environment. But we can very easily scale up the number of options. How many trials do you think it would take you to identify the best option below?
#define larger set of option
K <- 10
meanVec <- runif(n=K, min = -10, max = 10) #Payoff means, which we sample from uniform distribution between -10 to 10
sigmaVec <- rep(1, K)
# Use samples to approximate the reward distribution
sampleDF <- data.frame()
nsamples = 10000 #Samples used to approximate the normal distribution
plotsamples = 25 #samples to plot as dots
rewardSamples <- c(sapply(1:K, FUN=function(k) banditGenerator(rep(k, nsamples))))#use an apply function to simulate multiple samples of each arm
sampleDF <- data.frame(option = rep(1:K, each = nsamples), payoff=rewardSamples, nsample = rep(1:nsamples, K)) #create a dataframe for plotting
ggplot(sampleDF, aes(x = factor(option), y = payoff, fill = factor(option)))+
geom_flat_violin(position = position_nudge(x = 0.3, y = 0), alpha = .5) +
geom_jitter(data = subset(sampleDF, nsample<=plotsamples), aes(y = payoff, color = factor(option)),width = 0.2 )+ #Plot only a subset of points show how well a limited number of samples approximates the true distribution
ylab('Payoff')+
xlab('Option')+
theme_classic() +
scale_fill_viridis_d()+
scale_color_viridis_d()+
ggtitle('Payoff conditions')+
theme(legend.position ='none')
Exercise 1.1. Can you modify the payoff generating function to describe other tasks? Sometimes rewards are binary, such as with success or failure outcomes when publishing a paper or hunting for food. With binary reward structures, you can use Bernoulli distribution in place of a Gaussian distribution. The code snippet below contains a brief example.
K <- 10
successVec = runif(K) #different success probabilities for each option sampled from a uniform distribution between 0 and 1
binaryGenerator <- function(k) {#k is an integer or k vector of integers, selecting one of the 1:K arms of the bandit
payoff <- rbernoulli(n = length(k), p=successVec[k]) #a Bernoulli distribution can be used to describe a binary reward structure
return (payoff)
}
binaryGenerator(sample(1:10, 25, replace = TRUE))
Live demonstrations
Let’s now dive in and experience learning and exploration for yourself!
Individual learning
We will start with a simple individual learning version of the task, where you’ll each be on your own. Think about how you learn which options provide the best rewards and how you navigate the explore-exploit dilemma.
The demo game is an 16-armed bandit task. Suppose you are in the fishing boat and trying to catch as many fish as possible. The ocean is divided into the 16 grid cells. Some grid areas will provide more fish than other grid areas. To make a decision, just click/tap one of the cells.
Individual learning experiment
Exercise 1.2. At the end of the experiment, you can download the results. Once you have finished tutorial 2, try fitting a reinforcement learning model on your data and see which parameters best describe your data. What potential limitations do you think there will be with fitting a model on 10 trials of data?
Social learning problems
For both humans and animals, much of what we learn, we learn from others. Social information is cheap: we can expediently learn which actions will lead to reward or to punishment by observing others, instead of blundering our way through individual trial and error (Heyes, 2002; Hoppitt & Laland, 2013). Social information also has incredible depth (at least certainly for humans): we can extract complex inferences about the hidden goals, beliefs, and preferences of others from subtle patterns of behavior (Jara-Ettinger, 2019; Wu et al., 2022). Yet even simple social dynamics can give rise to intelligent group-level behavior (Berdahl et al., 2013), where challenging or impossible problems for a lone individual is no match for the power of a collective.
While researchers use a wide range of problems to study social learning and collective behavior, here, we focus on providing code and intuitions about how to implement the simplest possible social learning task: the multi-armed bandit problem. Later, we show how through minor tweaks, we can modify the same basic setting to create a galaxy of different social learning problems.