Table of Contents
Fetching ...

Program-Based Strategy Induction for Reinforcement Learning

Carlos G. Correa, Thomas L. Griffiths, Nathaniel D. Daw

TL;DR

The paper tackles the gap between traditional incremental reinforcement-learning models and the discrete, heuristic strategies observed in humans and animals. It introduces Bayesian program induction to infer program-structured strategies that balance simplicity and effectiveness, via a prior over programs and a likelihood linked to the value of a strategy $V(pi)$ in a given task. Applying this framework to several bandit tasks reveals interpretable strategies such as WSLS-like rules, reward accumulators, horizon-aware exploration, and discrete decision states, offering a resource-rational explanation for adaptive behavior. The approach yields a modular, interpretable account of strategy induction with potential for extension to planning and behavior analysis, providing an alternative to opaque neural network-based strategy discovery.

Abstract

Typical models of learning assume incremental estimation of continuously-varying decision variables like expected rewards. However, this class of models fails to capture more idiosyncratic, discrete heuristics and strategies that people and animals appear to exhibit. Despite recent advances in strategy discovery using tools like recurrent networks that generalize the classic models, the resulting strategies are often onerous to interpret, making connections to cognition difficult to establish. We use Bayesian program induction to discover strategies implemented by programs, letting the simplicity of strategies trade off against their effectiveness. Focusing on bandit tasks, we find strategies that are difficult or unexpected with classical incremental learning, like asymmetric learning from rewarded and unrewarded trials, adaptive horizon-dependent random exploration, and discrete state switching.

Program-Based Strategy Induction for Reinforcement Learning

TL;DR

The paper tackles the gap between traditional incremental reinforcement-learning models and the discrete, heuristic strategies observed in humans and animals. It introduces Bayesian program induction to infer program-structured strategies that balance simplicity and effectiveness, via a prior over programs and a likelihood linked to the value of a strategy in a given task. Applying this framework to several bandit tasks reveals interpretable strategies such as WSLS-like rules, reward accumulators, horizon-aware exploration, and discrete decision states, offering a resource-rational explanation for adaptive behavior. The approach yields a modular, interpretable account of strategy induction with potential for extension to planning and behavior analysis, providing an alternative to opaque neural network-based strategy discovery.

Abstract

Typical models of learning assume incremental estimation of continuously-varying decision variables like expected rewards. However, this class of models fails to capture more idiosyncratic, discrete heuristics and strategies that people and animals appear to exhibit. Despite recent advances in strategy discovery using tools like recurrent networks that generalize the classic models, the resulting strategies are often onerous to interpret, making connections to cognition difficult to establish. We use Bayesian program induction to discover strategies implemented by programs, letting the simplicity of strategies trade off against their effectiveness. Focusing on bandit tasks, we find strategies that are difficult or unexpected with classical incremental learning, like asymmetric learning from rewarded and unrewarded trials, adaptive horizon-dependent random exploration, and discrete state switching.
Paper Structure (8 sections, 5 equations, 3 figures, 1 table)

This paper contains 8 sections, 5 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Strategies for the two-armed bandit with stationary, Bernoulli rewards. a) The Pareto frontier of deterministic strategies, which are maximal points in the space defined by the normalized value and prior. Points correspond to individual strategies. One stochastic strategy described in the text is included (unfilled point) for comparison. Value is normalized so that chance behavior has a value of 0 and behavior guided by an oracle has a value of 1. Also shown is the Bayes-optimal solution to the task (dotted line). Example strategies are shown in b-e), marked in a), and described in the text.
  • Figure 2: Adaptive random exploration, using a stochastic accumulator. a) The Pareto frontier for solutions in the long horizon condition. Many solutions simply accumulate rewards, but increase the determinism of policies (less probable under prior) to achieve greater value. b) The discovered accumulator. The "$\blacksquare$" is a placeholder for the inverse temperature. c) Replotted from Fig. 2b in wilson2014humans. Across conditions, there is a horizon-specific adjustment of decision noise. d) The optimal horizon-specific inverse temperature for the stochastic accumulator leads to more random behavior for the long horizon condition. Horizon-specific inverse temperature was selected to maximize Eq. \ref{['eq:posterior']} with $\beta=300$. Rewards were scaled by $\frac{1}{100}$. e) Replotted from Fig. 3a in wilson2014humans. In the long horizon condition, there is horizon-specific adjustment of decision noise. f) Memory magnitude in the accumulator grows over time, resulting in less randomness for later trials in the long horizon condition. Same inverse temperature as d).
  • Figure 3: Discovering strategies with discrete decision states for a bandit task with non-stationary reward. a) The state-based choice model in ebitz2018exploration, featuring distinct states where behavior is either more exploratory or exploitative. Adapted from Fig. 1c in ebitz2018exploration. b) A WSLS strategy that randomly samples from actions after a loss. c) A state machine for WSLS strategy. States correspond to distinct action distributions. Edge color indicates action, dashed edges correspond to losses, and edges with action probability of less than 1% were excluded. The initial conditions of the strategy were modified to simplify the state machine. d) A more complex strategy that exploits after a single win, but requires consecutive losses to switch. e) State machine for d). Wins always lead to a state at top (as indicated), so wins are excluded elsewhere. States with 4+ consecutive losses (at bottom) were collapsed because they have similar action probabilities and identical transitions.