Tutorial 1 - Designing a social learning environment

Charley Wu and Wataru Toyokawa

2023-07-01

packages <- c('tidyverse', "vembedr", 'formatR')
#invisible(lapply(packages, install.packages, character.only = TRUE)) #Install packages if any are missing
invisible(lapply(packages, require, character.only = TRUE)) #Load all packages
knitr::opts_chunk$set(warning = FALSE, message = FALSE) 
set.seed(1234) #set seed for reproducibility

Welcome to tutorial 1 of COSMOS Konstanz. These code notebooks are designed to be self-contained learning tools for building on the concepts taught during the summer school, but should also be accessible for anyone from around the world interested in these topics.

The code notebooks will introduce theoretical concepts, provide basic mathematical formulas, and helpful code snippets that you can use in your own projects. There are also links to demonstrations of simple online experiments and interactive Shiny apps that will allow you to play around with different experimental paradigms, computational models, and simulation frameworks. These notebooks will also contain exercises, which offer suggestions for how to go beyond the materials.

Multi-armed bandit problem

Our first goal is to design a simple learning environment for studying social learning and collective behavior. We start with a multi-armed bandit problem (Robbins, 1952; Steyvers et al., 2009; Wald, 1947) as one of the simplest tasks commonly used to study learning in humans, animals, and artificial agents.

Multi-armed bandits are an analogy to a row of slot machines in a casino (and not an octopus who has taken to a life of petty crime) The term “multi-armed bandit” is simply a colorful metaphor for a row of slot machines in a casino. Since slot machines are notorious for stealing your money, they’re sometimes known as a “one-armed bandit”, in reference to the lever used to play the machine.

Each of the \(k \in [1,\ldots, K]\) slot machines refer to one of \(K\) different available decisions or options, each yielding a reward from an initially unknown payoff function. We can easily map this metaphor onto a wide range of real-world learning problems, such as choosing what to order at a restaurant, choosing where to forage for food, or choosing which crops to plant. The goal is thus to maximize your payoffs, by sequentially playing one of the options at each time \(t\).

In order to do so, one must balance between exploring new options to acquire information about payoffs (to inform future decisions) and exploiting options known to have high-payoffs (in order to acquire immediate rewards). This is known as the exploration-exploitation dilemma (Mehlhorn et al., 2015) and every living organism capable of learning must achieve an efficient balance to these dual goals. Yet, given a finite time horizon for learning, optimal solutions are generally unavailable (Aplin et al., 2015; Deffner et al., 2020; Sasaki et al., 2013).

2-armed Gaussian bandits

Let’s start from the simplest case and program a 2-armed bandit with Gaussian rewards. Imagine you are a farmer, and can choose to either plant one of two crops: wheat vs. potato. At the end of the growing season, you then earn a reward proportional to how well your crop performed. However, one of the crops may produce reliably better yields, and it is up to the farmer to figure out which crop is better.

The farmer may experience better yields or poorer yields depending on the choice of crops as well as some raddom noise from the environment.

Year after year, there may be variations in yield for a given crop, which we can model using Gaussian noise. Thus, we can describe the reward for option \(k\) as a sample from a Gaussian distribution (also known as a Normal distribution, hence the fancy \(\mathcal{N}\)):

\[ r_k \sim \mathcal{N}(\mu_k, \sigma_k^2) \]

We can now describe a payoff function for each option in terms of a mean or expected reward \(\mu_k\) and the variability \(\sigma_k^2\) (expressed as a variance). Note that the variability can also be expressed in terms of a standard deviation as \(\sigma_k = \sqrt{\sigma_k^2}\).

(Note: people using MacOS Ventura may have issues rendering the equations due to changes in the font library. If you see a box instead of a fancy N, then you can fix the issue using right-click > Math Settings > Math Renderer > “Common HTML”).

Generative model of the environment

A generative model of the environment takes a choice as an input and returns a payoff as an output.

Let’s now program a generative model of the bandit task. By defining the number of options, and providing the expected reward and variability for each option, we can define a function that takes an action and returns a reward. We provide the function in the code block below and generate payoffs for 25 random choices.

K = 2 #number of options
meanVec <- runif(n=K, min = -10, max = 10) #Payoff means, which we sample from uniform distribution between -10 to 10
sigmaVec <- rep(1, K) #Variability of payoffs (as a standard deviation), which we assume is the same for all options. 

banditGenerator <- function(k) {#k is an integer or k vector of integers, selecting one of the 1:K arms of the bandit
  payoff <- rnorm(n = length(k), mean = meanVec[k], sd = sigmaVec[k])
  return (payoff)
}

actionSequence <- sample(1:K, size = 25, replace = TRUE) # select 25 random actions
payoffs <- banditGenerator(actionSequence)

df <- data.frame(action = actionSequence, payoff = payoffs)
knitr::kable(df)

action	payoff
2	3.8159985
2	0.7586613
1	-8.3533680
2	2.4643047
1	-7.0206883
1	-8.3729508
1	-6.8577509
2	2.8216237
2	2.7562503
2	2.4509950
2	2.4083578
1	-7.0019557
2	1.9492492
2	2.4573833
2	2.4558480
1	-7.0476604
2	3.4755511
2	0.7164596
2	0.2416400
2	2.9891610
2	2.4674153
2	2.6085539
2	3.6877424
2	3.2287656
1	-7.6778111

Let’s now visualize the environment, where we use a large number of samples (n=10000) to paint an accurate picture of the true generative reward structure (vertical density plot), and plot that against a small number (n=25) of sampled payoffs (dots):

source('codeSnippets/geom_flat_violin.R') 
# Use samples to approximate the reward distribution
sampleDF <- data.frame()
nsamples = 10000 #Samples used to approximate the normal distribution
plotsamples = 25 #samples to plot as dots

rewardSamples <- c(sapply(1:K, FUN=function(k) banditGenerator(rep(k, nsamples))))#use an apply function to simulate multiple samples of each arm
sampleDF <- data.frame(option = rep(1:K, each = nsamples), payoff=rewardSamples, nsample = rep(1:nsamples, K)) #create a dataframe for plotting

ggplot(sampleDF, aes(x = factor(option), y = payoff, fill = factor(option)))+
  geom_flat_violin(position = position_nudge(x = 0.3, y = 0), alpha = .5) +
  geom_jitter(data = subset(sampleDF, nsample<=plotsamples), aes(y = payoff, color = factor(option)),width = 0.2 )+  #Plot only a subset of points show how well a limited number of samples approximates the true distribution
  ylab('Payoff')+
  xlab('Option')+
  theme_classic() +
  scale_fill_viridis_d()+
  scale_color_viridis_d()+
  ggtitle('Payoff conditions')+
  theme(legend.position ='none')

Scaling up

Of course, a 2-armed bandit is quite a simple learning environment. But we can very easily scale up the number of options. How many trials do you think it would take you to identify the best option below?

#define larger set of option
K <- 10
meanVec <- runif(n=K, min = -10, max = 10) #Payoff means, which we sample from uniform distribution between -10 to 10
sigmaVec <- rep(1, K)

# Use samples to approximate the reward distribution
sampleDF <- data.frame()
nsamples = 10000 #Samples used to approximate the normal distribution
plotsamples = 25 #samples to plot as dots

rewardSamples <- c(sapply(1:K, FUN=function(k) banditGenerator(rep(k, nsamples))))#use an apply function to simulate multiple samples of each arm
sampleDF <- data.frame(option = rep(1:K, each = nsamples), payoff=rewardSamples, nsample = rep(1:nsamples, K)) #create a dataframe for plotting

ggplot(sampleDF, aes(x = factor(option), y = payoff, fill = factor(option)))+
  geom_flat_violin(position = position_nudge(x = 0.3, y = 0), alpha = .5) +
  geom_jitter(data = subset(sampleDF, nsample<=plotsamples), aes(y = payoff, color = factor(option)),width = 0.2 )+  #Plot only a subset of points show how well a limited number of samples approximates the true distribution
  ylab('Payoff')+
  xlab('Option')+
  theme_classic() +
  scale_fill_viridis_d()+
  scale_color_viridis_d()+
  ggtitle('Payoff conditions')+
  theme(legend.position ='none')

Exercise 1.1. Can you modify the payoff generating function to describe other tasks? Sometimes rewards are binary, such as with success or failure outcomes when publishing a paper or hunting for food. With binary reward structures, you can use Bernoulli distribution in place of a Gaussian distribution. The code snippet below contains a brief example.

K <- 10
successVec = runif(K) #different success probabilities for each option sampled from a uniform distribution between 0 and 1
binaryGenerator <- function(k) {#k is an integer or k vector of integers, selecting one of the 1:K arms of the bandit
  payoff <- rbernoulli(n = length(k), p=successVec[k]) #a Bernoulli distribution can be used to describe a binary reward structure
  return (payoff)
}
binaryGenerator(sample(1:10, 25, replace = TRUE))

Live demonstrations

Let’s now dive in and experience learning and exploration for yourself!

Individual learning

We will start with a simple individual learning version of the task, where you’ll each be on your own. Think about how you learn which options provide the best rewards and how you navigate the explore-exploit dilemma.

A 16-armed bandit task.

The demo game is an 16-armed bandit task. Suppose you are in the fishing boat and trying to catch as many fish as possible. The ocean is divided into the 16 grid cells. Some grid areas will provide more fish than other grid areas. To make a decision, just click/tap one of the cells.

Individual learning experiment

Exercise 1.2. At the end of the experiment, you can download the results. Once you have finished tutorial 2, try fitting a reinforcement learning model on your data and see which parameters best describe your data. What potential limitations do you think there will be with fitting a model on 10 trials of data?

A galaxy of social learning problems

While we have focused on a very simple multi-armed bandit task, it is also offers a high-degree of versatility. We have hinted at several tweaks to the paradigm in some exercises above, but we will now draw some more concrete connections to other domains for studying social learning.

A galaxy of social learning problems.

Spatial structure

Many social interactions unfold in spatial settings, where both social information (Sridhar et al., 2021) and rewards are defined by spatial structure (Hawkins et al., in press). For instance, high-valued rewards may tend to be clustered together, such like good restaurants in a hip part of a city or prey/vegetation naturally occurring in patchy distributions.

Spatially correlated bandit task from Wu et al. (2018)

One paradigm that adds spatial structure to bandit tasks is the “spatially correlated bandit” (Wu et al., 2018). Whereas bandit tasks typically assume that the rewards for each option are completely independent from all other options, the spatially correlated bandit places up to 121 options onto a 2D grid and informs participants options tend to have similar rewards to other nearby options. The spatial structure of rewards can be used (by people as young as age 5: Giron et al., in press) to guide exploration towards promising regions of the search space. With the following link, you can try a version of the task where the rewards were derived from the yield of various crops on an experimental farm, such that the rows and columns of the grid correspond to the rows and columns of a field (see Exp. 3 in Wu et al., 2018 for details):

More recently, Wu et al. (2023) developed an immersive variant of the task with collective foraging dynamics. Participants are tasked with foraging for binary rewards (indicated by a blue splash), and in the condition below, rewards are clustered together. This means that observing another player acquire a reward provides a valuable social signal that there are likely to be other rewards nearby:

Task tutorial

embed_url("https://www.youtube.com/watch?v=QksKYOoElxg")

POV video from social round with smooth rewards

embed_url("https://www.youtube.com/watch?v=wyk7RbmHiok")

Foraging tasks

Much research on social learning has used the naturalistic context of foraging, where an organism can use social cues to help find food or other important resources (Giraldeau & Caraco, 2018; Hoppitt & Laland, 2013; Zentall et al., 1988). Foraging tasks can be framed as a special kind of multi-armed bandit, where the dynamics of the natural world mean that rewards are typically finite and diminish over time. Just as bandits simply decision-making problems into a discrete number of choices, research on foraging also simplifies the natural world into a discrete number of patches. For instance, in this illustration of a producer-scrounger task (Barnard & Sibly, 1981), experimenters place food pellets in a limited number of wells on a grid and observe how animals engage in either individual learning (“producer” strategy) or social imitation (“scrounger” strategy). We will be discussing the dynamics of different learning strategies in more detail in Tutorial 2.

Producer-Scrounger task from Keynan et al. (2015)

Exercise 1.5. Can you modify the bandit generator to capture the dynamics of a foraging task and produce diminishing returns? For instance, each time an option is chosen, the expected rewards could decays towards 0. This might mimic a real “foraging” situation where there are only limited amounts of food in each patch. You may need to use a different reward distribution than a Gaussian, since rewards are typically non-negative (but not always, since some resources can inflict harm; i.e., eating rotten food).

Innovation and cumulative culture

Beyond only foraging and learning in spatial environments, we also want to describe social learning problems in more open-ended environments. We may want to model how people or animals come up with new innovations, such as new behaviors (Miu et al., 2018) or new tools (Mesoudi & O’Brien, 2008). For example, in the arrow head task by Mesoudi & O’Brien (2008), participants were tasked with learning which combination of height, width, shape, color, and thickness produced the best hunting outcomes.

from Mesoudi & O’Brien (2008)

One method for describing learning along an arbitrary number of abstract dimensions is a fitness landscape. Originally introduced in the 1930s to study genetic evolution, fitness landscapes can also be used to map any combination of latent features to some fitness value. Learning, can thus be compared to an optimization process as an organism learns to acquire better solutions through a combination of both individual and social learning (Barkoczi et al., 2016; Mason & Watts, 2012).

from De Visser & Krug (2014)

This is still quite similar to how the spatially correlated bandit (Wu et al., 2018) maps each of the options onto a 2D grid. The main difference is that a fitness landscape introduces an arbitrary number of dimensions that may have abstract rather than spatial relationships between options or “locations” on the search space. And while there are many similarities in how people learn and search in spatial vs. non-spatial environments, there are also important differences (Wu et al., 2020). For instance, compared to learning spatially correlated rewards, Wu et al. (2020) found that participants explored in a distinctly different fashion when learning on the exact same reward structure, but the with x- and y-dimensions mapped instead onto the tilt and stripe frequency of a Gabor patch.

from Wu et al. (2020)

Beyond bandits

Of course, bandit tasks cannot capture all interesting learning problems (Brändle et al., 2022). But it is incredibly versatile and using similar code to what have provided above, you can describe a wide-range of interesting social learning problems.

References

Aplin, L. M., Farine, D. R., Morand-Ferron, J., Cockburn, A., Thornton, A., & Sheldon, B. C. (2015). Experimentally induced innovations lead to persistent culture via conformity in wild birds. Nature, 518(7540), 538–541.

Barkoczi, D., Analytis, P. P., & Wu, C. M. (2016). Collective search on rugged landscapes: A crossenvironmental analysis. In A. Papafragou, D. Grodner, D. Mirman, & J. C. Trueswell (Eds.), Proceedings of the 38th Annual Conference of the Cognitive Science Society (pp. 918–923). Cognitive Science Society.

Barnard, C. J., & Sibly, R. M. (1981). Producers and scroungers: A general model and its application to captive flocks of house sparrows. Animal Behaviour, 29(2), 543–550.

Berdahl, A., Torney, C. J., Ioannou, C. C., Faria, J. J., & Couzin, I. D. (2013). Emergent sensing of complex environments by mobile animal groups. Science, 339(6119), 574–576.

Brändle, F., Binz, M., & Schulz, E. (2022). Exploration beyond bandits. In I. Cogliati Dezza, E. Schulz, & C. M. Wu (Eds.), The drive for knowledge: The science of human information seeking (pp. 147–168). Cambridge University Press. https://doi.org/10.1017/9781009026949.008

De Visser, J., & Krug, J. (2014). Empirical fitness landscapes and the predictability of evolution. Nature Reviews Genetics, 15(7), 480–490.

Deffner, D., Kleinow, V., & McElreath, R. (2020). Dynamic social learning in temporally and spatially variable environments. Royal Society Open Science, 7(12), 200734.

Giraldeau, L.-A., & Caraco, T. (2018). Social foraging theory. In Social foraging theory. Princeton University Press.

Giron, A. P., Ciranka, S., Schulz, E., Bos, W. van den, Ruggeri, A., Meder, B., & Wu, C. M. (in press). Developmental changes resemble stochastic optimization. Nature Human Behaviour. https://doi.org/10.31234/osf.io/9f4k3

Hawkins, R. D., Berdahl, A. M., Pentland, A., Tenenbaum, J. B., Goodman, N. D., Krafft, P., et al. (in press). Flexible social inference facilitates targeted social learning when rewards are not observable. Nature Human Behaviour.

Heyes, C. (2002). Transformational and associative theories of imitation. Imitation in Animals and Artifacts, 501.

Hoppitt, W., & Laland, K. N. (2013). Social learning. In Social learning. Princeton University Press.

Jara-Ettinger, J. (2019). Theory of mind as inverse reinforcement learning. Current Opinion in Behavioral Sciences, 29, 105–110.

Keynan, O., Ridley, A. R., & Lotem, A. (2015). Social foraging strategies and acquisition of novel foraging skills in cooperatively breeding arabian babblers. Behavioral Ecology, 26(1), 207–214.

Mason, W., & Watts, D. J. (2012). Collaborative learning in networks. Proceedings of the National Academy of Sciences, 109(3), 764–769.

Mehlhorn, K., Newell, B. R., Todd, P. M., Lee, M. D., Morgan, K., Braithwaite, V. A., Hausmann, D., Fiedler, K., & Gonzalez, C. (2015). Unpacking the exploration–exploitation tradeoff: A synthesis of human and animal literatures. Decision, 2, 191–215.

Mesoudi, A., & O’Brien, M. J. (2008). The cultural transmission of great basin projectile-point technology i: An experimental simulation. American Antiquity, 73(1), 3–28.

Miu, E., Gulley, N., Laland, K. N., & Rendell, L. (2018). Innovation and cumulative culture through tweaks and leaps in online programming contests. Nature Communications, 9(1), 1–8.

Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58, 527–535.

Sasaki, T., Granovskiy, B., Mann, R. P., Sumpter, D. J., & Pratt, S. C. (2013). Ant colonies outperform individuals when a sensory discrimination task is difficult but not when it is easy. Proceedings of the National Academy of Sciences, 110(34), 13769–13773.

Sridhar, V. H., Li, L., Gorbonos, D., Nagy, M., Schell, B. R., Sorochkin, T., Gov, N. S., & Couzin, I. D. (2021). The geometry of decision-making in individuals and collectives. Proceedings of the National Academy of Sciences, 118(50), e2102157118.

Steyvers, M., Lee, M. D., & Wagenmakers, E.-J. (2009). A Bayesian analysis of human decision-making on bandit problems. Journal of Mathematical Psychology, 53, 168–179.

Wald, A. (1947). Sequential analysis. Wiley, New York.

Wu, C. M., Deffner, D., Kahl, B., Meder, B., Ho, M. H., & Kurvers, R. H. (2023). Visual-spatial dynamics drive adaptive social learning in immersive environments. bioRxiv. https://doi.org/10.1101/2023.06.28.546887

Wu, C. M., Schulz, E., Garvert, M. M., Meder, B., & Schuck, N. W. (2020). Similarities and differences in spatial and non-spatial cognitive maps. PLOS Computational Biology, 16, 1–28. https://doi.org/10.1371/journal.pcbi.1008149

Wu, C. M., Schulz, E., Speekenbrink, M., Nelson, J. D., & Meder, B. (2018). Generalization guides human exploration in vast decision spaces. Nature Human Behaviour, 2, 915--924. https://doi.org/10.1038/s41562-018-0467-4

Wu, C. M., Vélez, N., & Cushman, F. A. (2022). Representational exchange in human social learning: Balancing efficiency and flexibility. In I. C. Dezza, E. Schulz, & C. M. Wu (Eds.), The Drive for Knowledge: The Science of Human Information-Seeking. Cambridge University Press.

Zentall, T. R., Galef, B. G., et al. (1988). Social learning: Psychological and biological perspectives. Psychology Press.