Approximate optimality and the risk/reward tradeoff in a class of bandit problems
Zengjing Chen, Larry G. Epstein, Guodong Zhang
TL;DR
The paper addresses a sequential, risk-aware decision problem with known payoff distributions across $K$ arms and analyzes approximate optimality as the horizon grows large. It introduces a two-attribute utility $u$ over the mean and a scaled deviation term, and proves a nonlinear CLT that yields a limit value $V$ depending only on the mean-variance set of arms and its extreme points. Depending on the shape of $u$ and parameter values, the results delineate when optimal behavior is to specialize in a single arm or to diversify over time, with mean-variance models implying constant risk attitudes and time diversification unnecessary, while mean-semivariance and shortfall-style utilities produce richer diversification patterns. The analysis shows that the asymptotic value can be expressed via extreme arms and a nonlinear PDE (via a dynamic programming/HJB framework) and provides explicit strategies in several cases, illustrating how risk attitudes endogenously influence the risk/reward tradeoff in long-horizon bandit-like problems. Overall, the work offers a tractable, analytically grounded bridge between risk-sensitive decision theory and bandit problems under known distributions, with potential implications for dynamic risk management and strategic allocation decisions.
Abstract
This paper studies a sequential decision problem where payoff distributions are known and where the riskiness of payoffs matters. Equivalently, it studies sequential choice from a repeated set of independent lotteries. The decision-maker is assumed to pursue strategies that are approximately optimal for large horizons. By exploiting the tractability afforded by asymptotics, conditions are derived characterizing when specialization in one action or lottery throughout is asymptotically optimal and when optimality requires intertemporal diversification. The key is the constancy or variability of risk attitude. The main technical tool is a new central limit theorem.
