A Survey of Reinforcement Learning For Economics

Pranjal Rawat

A Survey of Reinforcement Learning For Economics

Pranjal Rawat

TL;DR

This survey (re) introduces reinforcement learning methods to economists and examines the practical vulnerabilities of these algorithms, noting their brittleness, sample inefficiency, sensitivity to hyperparameters, and the absence of global convergence guarantees outside of tabular settings.

Abstract

This survey (re)introduces reinforcement learning methods to economists. The curse of dimensionality limits how far exact dynamic programming can be effectively applied, forcing us to rely on suitably "small" problems or our ability to convert "big" problems into smaller ones. While this reduction has been sufficient for many classical applications, a growing class of economic models resists such reduction. Reinforcement learning algorithms offer a natural, sample-based extension of dynamic programming, extending tractability to problems with high-dimensional states, continuous actions, and strategic interactions. I review the theory connecting classical planning to modern learning algorithms and demonstrate their mechanics through simulated examples in pricing, inventory control, strategic games, and preference elicitation. I also examine the practical vulnerabilities of these algorithms, noting their brittleness, sample inefficiency, sensitivity to hyperparameters, and the absence of global convergence guarantees outside of tabular settings. The successes of reinforcement learning remain strictly bounded by these constraints, as well as a reliance on accurate simulators. When guided by economic structure, reinforcement learning provides a remarkably flexible framework. It stands as an imperfect, but promising, addition to the computational economist's toolkit. A companion survey (Rust and Rawat, 2026b) covers the inverse problem of inferring preferences from observed behavior.

A Survey of Reinforcement Learning For Economics

TL;DR

Abstract

Paper Structure (106 sections, 7 theorems, 82 equations, 9 figures, 18 tables)

This paper contains 106 sections, 7 theorems, 82 equations, 9 figures, 18 tables.

Introduction
A Brief History of Reinforcement Learning
Animal Psychology
Board Games
Optimal Control
Reinforcement Learning Algorithms
The Classical Synthesis
Monte Carlo Estimation
Sutton (1988)
Watkins (1989)
Williams (1992)
Tesauro (1994)
SARSA (1994)
Baird (1995)
Actor-Critic Methods (2000)
...and 91 more sections

Key Result

Theorem 1

Under regularity conditions, the linear semi-gradient estimator $\hat{h}$ satisfies $\|\hat{h} - h\|_2 = O_P(n^{-1/2}(T-1)^{-1/2})$, where $\|\cdot\|_2$ denotes the $L^2(P)$ norm.The $L^2(P)$ norm is $\|f\|_2 = (\int f(x)^2 \, dP(x))^{1/2}$, measuring average squared deviation under the probability

Figures (9)

Figure 1: The four phases of a single MCTS simulation in AlphaGo Zero. (a) Selection traverses the tree from the root, choosing at each node the action maximizing a UCB-like score balancing exploitation ($Q$) and exploration ($P/N$). (b) Expansion adds a new leaf node when the traversal reaches an unexplored position. (c) The neural network $f_\theta$ evaluates the new position, producing move priors $\mathbf{p}$ and a value estimate $v$. (d) Backup propagates $v$ along the traversed path, updating mean values $Q(s,a)$ and visit counts $N(s,a)$ at each edge.
Figure 2: The Brock--Mirman economy ($\alpha=0.36$, $\beta=0.96$, 1,000 states). (a) Value iteration on a scalar Bellman equation: the staircase iterates $V_{k+1} = TV_k$, converging at the linear rate $\gamma$. (b) Policy iteration as Newton's method: each step solves for the fixed point of the active policy operator $T^{\pi_k}$, jumping to the tangent line's intersection with the diagonal. (c) Sup-norm error $\|V_k - V^*\|_\infty$ for the discretized model; VI requires 567 iterations, PI converges in 11.
Figure 3: Learned value function $V(s)$ at percentage-based checkpoints of each algorithm's $V^*$-convergence time $T$, defined as the first episode where $\max_s |V(s) - V^*(s)| < 0.1$; algorithms that never converge use $T = 500{,}000$. All columns share the same color scale, so a converged algorithm's grid visually matches the $V^*$ reference column. Off-policy methods (Q-learning, Q($\lambda$), DQN) converge to $V^*$ everywhere. On-policy methods (SARSA, REINFORCE, NPG, PPO) show persistent discrepancies at off-path states even at the final episode.
Figure 4: Policy grids at percentage-based checkpoints of $V^*$-convergence time $T$ (same definition as Figure \ref{['fig:value_heatmaps']}). Arrows indicate the greedy action at each state; dots indicate the stay action. Off-policy methods converge to $\pi^*$ everywhere. On-policy methods retain incorrect actions at states far from the optimal path.
Figure 5: Bus engine replacement benchmark. Left: computation time vs. fleet size (log scale). Right: discounted return vs. fleet size for DP, DQN, and heuristic baselines.
...and 4 more figures

Theorems & Definitions (9)

Theorem 1: AdusumilliEckardt2022, Theorem 1
Theorem 2: AdusumilliEckardt2022, Theorem 5
Theorem 3: BreroEtAl2021, Propositions 1--4, informal
Theorem 4: Theorem 1 of Liu2024strategic
Theorem 5: Theorem 2 of Liu2024strategic
Definition D1: Confounded MDP zhang2020causal
Definition D2: Causal Bellman Operator zhang2020causal
Lemma L1: Bias of Naive Off-Policy Evaluation kallus2020confounding
Theorem 6: Backdoor Identification in Confounded MDPs pearl2009causalityzhang2020causal

A Survey of Reinforcement Learning For Economics

TL;DR

Abstract

A Survey of Reinforcement Learning For Economics

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (9)