Approximate optimality and the risk/reward tradeoff in a class of bandit problems

Zengjing Chen; Larry G. Epstein; Guodong Zhang

Approximate optimality and the risk/reward tradeoff in a class of bandit problems

Zengjing Chen, Larry G. Epstein, Guodong Zhang

TL;DR

The paper addresses a sequential, risk-aware decision problem with known payoff distributions across $K$ arms and analyzes approximate optimality as the horizon grows large. It introduces a two-attribute utility $u$ over the mean and a scaled deviation term, and proves a nonlinear CLT that yields a limit value $V$ depending only on the mean-variance set of arms and its extreme points. Depending on the shape of $u$ and parameter values, the results delineate when optimal behavior is to specialize in a single arm or to diversify over time, with mean-variance models implying constant risk attitudes and time diversification unnecessary, while mean-semivariance and shortfall-style utilities produce richer diversification patterns. The analysis shows that the asymptotic value can be expressed via extreme arms and a nonlinear PDE (via a dynamic programming/HJB framework) and provides explicit strategies in several cases, illustrating how risk attitudes endogenously influence the risk/reward tradeoff in long-horizon bandit-like problems. Overall, the work offers a tractable, analytically grounded bridge between risk-sensitive decision theory and bandit problems under known distributions, with potential implications for dynamic risk management and strategic allocation decisions.

Abstract

This paper studies a sequential decision problem where payoff distributions are known and where the riskiness of payoffs matters. Equivalently, it studies sequential choice from a repeated set of independent lotteries. The decision-maker is assumed to pursue strategies that are approximately optimal for large horizons. By exploiting the tractability afforded by asymptotics, conditions are derived characterizing when specialization in one action or lottery throughout is asymptotically optimal and when optimality requires intertemporal diversification. The key is the constancy or variability of risk attitude. The main technical tool is a new central limit theorem.

Approximate optimality and the risk/reward tradeoff in a class of bandit problems

TL;DR

The paper addresses a sequential, risk-aware decision problem with known payoff distributions across

arms and analyzes approximate optimality as the horizon grows large. It introduces a two-attribute utility

over the mean and a scaled deviation term, and proves a nonlinear CLT that yields a limit value

depending only on the mean-variance set of arms and its extreme points. Depending on the shape of

and parameter values, the results delineate when optimal behavior is to specialize in a single arm or to diversify over time, with mean-variance models implying constant risk attitudes and time diversification unnecessary, while mean-semivariance and shortfall-style utilities produce richer diversification patterns. The analysis shows that the asymptotic value can be expressed via extreme arms and a nonlinear PDE (via a dynamic programming/HJB framework) and provides explicit strategies in several cases, illustrating how risk attitudes endogenously influence the risk/reward tradeoff in long-horizon bandit-like problems. Overall, the work offers a tractable, analytically grounded bridge between risk-sensitive decision theory and bandit problems under known distributions, with potential implications for dynamic risk management and strategic allocation decisions.

Abstract

Paper Structure (12 sections, 9 theorems, 119 equations)

This paper contains 12 sections, 9 theorems, 119 equations.

Introduction
Related literature
The Model
Preliminaries
Utility
Optimization and the value of a set of arms
Strategies and the risk/reward tradeoff
Concluding Comments
Appendix: Proofs
Proof of Theorem \ref{['thm-bandits']}
Proof of Theorem \ref{['thm-strategies']}
Proof of Theorem \ref{['thm-tradeoff']}

Key Result

Theorem 1

Let $u\in C(\mathbb{R}^{2})$ and let payoffs to the $K$ arms conform to (musigma), with $\underline{\sigma }\geq 0$. Suppose further that there exists $g\geq 1$ such that $u$ satisfies the growth condition$|u(x,y)|\leq c(1+||(x,y)||^{g-1})$ , and that payoffs satisfy $\sup_{1\leq k\leq K}E_{P}[|X_{k

Theorems & Definitions (9)

Theorem 1
Theorem 2
Theorem 3
Lemma 4
Lemma 5
Proposition 6: CLT
Corollary 7
Lemma 8
Lemma 9

Approximate optimality and the risk/reward tradeoff in a class of bandit problems

TL;DR

Abstract

Approximate optimality and the risk/reward tradeoff in a class of bandit problems

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (9)