Table of Contents
Fetching ...

Optimal design of experiments to identify latent behavioral types

Stefano Balietti, Brennan Klein, Christoph Riedl

TL;DR

This paper tackles the challenge of efficiently designing Bayesian experiments to distinguish latent behavioral types. It introduces two computational innovations—Gaussian Process Upper Confidence Bound-Pure Exploration (GPUCB-PE) for adaptive search and a Parameter-Sampled dataset evaluation—to dramatically reduce the cost of finding informative experimental designs. Applied to a Stop-Go imperfect-information game, the approach yields data-efficient discrimination among competing behavioral models and finds Roth-Erev reinforcement learning better explains human decisions than Bayes-Nash equilibrium. The authors demonstrate substantial computational gains, show that experts' predictions are often suboptimal, and argue that the framework can be integrated into online experimentation to rapidly test and iterate behavioral hypotheses.

Abstract

Bayesian optimal experiments that maximize the information gained from collected data are critical to efficiently identify behavioral models. We extend a seminal method for designing Bayesian optimal experiments by introducing two computational improvements that make the procedure tractable: (1) a search algorithm from artificial intelligence that efficiently explores the space of possible design parameters, and (2) a sampling procedure which evaluates each design parameter combination more efficiently. We apply our procedure to a game of imperfect information to evaluate and quantify the computational improvements. We then collect data across five different experimental designs to compare the ability of the optimal experimental design to discriminate among competing behavioral models against the experimental designs chosen by a "wisdom of experts" prediction experiment. We find that data from the experiment suggested by the optimal design approach requires significantly less data to distinguish behavioral models (i.e., test hypotheses) than data from the experiment suggested by experts. Substantively, we find that reinforcement learning best explains human decision-making in the imperfect information game and that behavior is not adequately described by the Bayesian Nash equilibrium. Our procedure is general and computationally efficient and can be applied to dynamically optimize online experiments.

Optimal design of experiments to identify latent behavioral types

TL;DR

This paper tackles the challenge of efficiently designing Bayesian experiments to distinguish latent behavioral types. It introduces two computational innovations—Gaussian Process Upper Confidence Bound-Pure Exploration (GPUCB-PE) for adaptive search and a Parameter-Sampled dataset evaluation—to dramatically reduce the cost of finding informative experimental designs. Applied to a Stop-Go imperfect-information game, the approach yields data-efficient discrimination among competing behavioral models and finds Roth-Erev reinforcement learning better explains human decisions than Bayes-Nash equilibrium. The authors demonstrate substantial computational gains, show that experts' predictions are often suboptimal, and argue that the framework can be integrated into online experimentation to rapidly test and iterate behavioral hypotheses.

Abstract

Bayesian optimal experiments that maximize the information gained from collected data are critical to efficiently identify behavioral models. We extend a seminal method for designing Bayesian optimal experiments by introducing two computational improvements that make the procedure tractable: (1) a search algorithm from artificial intelligence that efficiently explores the space of possible design parameters, and (2) a sampling procedure which evaluates each design parameter combination more efficiently. We apply our procedure to a game of imperfect information to evaluate and quantify the computational improvements. We then collect data across five different experimental designs to compare the ability of the optimal experimental design to discriminate among competing behavioral models against the experimental designs chosen by a "wisdom of experts" prediction experiment. We find that data from the experiment suggested by the optimal design approach requires significantly less data to distinguish behavioral models (i.e., test hypotheses) than data from the experiment suggested by experts. Substantively, we find that reinforcement learning best explains human decision-making in the imperfect information game and that behavior is not adequately described by the Bayesian Nash equilibrium. Our procedure is general and computationally efficient and can be applied to dynamically optimize online experiments.

Paper Structure

This paper contains 40 sections, 3 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Illustration of lowly- vs. highly-informative experiments. Models 1, 2, and 3 assign likelihoods to the data generated in two hypothetical experiments: A and B. Experiment A does not allow researchers to say with certainty which model fits the data best; in other words, the design of Experiment A generates data that contains "Low Information." On the other hand, Experiment B has a clear winner: Model 1 predicted the observed data as highly likely, while Models 2 and 3 did not. Therefore, the data generated by Experiment B contains "High Information." The optimal experimental design procedure outlined in this paper returns values for experimental parameters that are expected to maximally distinguish the likelihoods of the competing models, as in Experiment B.
  • Figure 2: Extensive-form representation of the game. If Player 1 chooses Go, Player 2 must make a choice between Left and Right. Possible payoffs of the game are $(0, 1, 2, A)$ with $A > 2$. The parameters to optimize are $A$ and $\pi$.
  • Figure 3: Information surfaces for experimental design. (a) Information surface originally presented in El-Gamal & Palfrey (1996) (b) Information surface generated by replicating the optimal design procedure in El-Gamal & Palfrey (1996); (c) Information surface generated using Parameter-Sampled GPUCB-PE, where the points shown represent coordinates searched by the algorithm, and the green star representing the point corresponding to the experiment that is predicted to produce the maximum information gain. Note: as described in El-Gamal & Palfrey (1996), the experiment with the maximum information is expected to occur when $A$ reaches its minimum value, which approaches (but is not equivalent to) $A=2.0$.
  • Figure 4: Comparison of algorithm performance measured by regret. Here we compare the performance of three different search methods for finding the optimal experiment: Random search, Grid search, and GPUCB-PE. We plot the mean regret over time for each method (averaged over 200 runs), where regret is defined as the difference between the maximum value uncovered by the algorithm and the average global maximum. We show the 95% confidence interval with translucent bands around the curves (error bars for Grid Search). For each search method, we mimic the steps of our Parameter-Sampled GPUCB-PE algorithm. That is, we sample datasets instead of exhaustively searching the parameter space: for every experimental parameter that the algorithm selects (i.e., for every coordinate searched on the information surface), 10,000 model parameters are sampled, and a synthetic dataset is generated using these sampled experimental and model parameters, for each of the three models being compared. These sampled datasets are then used to assign an information value to each coordinate in the information surface. The Parameter-Sampled GPUCB-PE algorithm finds a solution close to the global maximum almost immediately, whereas Random and Grid Search require more searches in order to find a value close to the global maximum.
  • Figure 5: Expert predictions and experiments chosen. (a) Raw data from the experts survey showing both the modal prediction of $A = 6.0$ and $\pi=0.5$, but more importantly, the inherent noise and uncertainty that the respondents showed (data shown with small jitter to avoid over plotting); (b) These five points in the experimental design space were ultimately the ones that were tested on Amazon Mechanical Turk. Note: the experiment with highest information gain is when $\pi=0.5$ but approaches$A=2.0$. We report results from the design $(\pi=0.5, A=2.0)$ as human participants would be unable to distinguish $A=2.000$ from $A=2.001$, for example.
  • ...and 11 more figures