Table of Contents
Fetching ...

A Prospect-Theoretic Policy Gradient Framework for Behaviorally Nuanced Reinforcement Learning

Olivier Lepel, Anas Barakat

TL;DR

This work introduces a CPT-based reinforcement learning framework that models human risk attitudes via reference points, nonlinear utilities, and probability distortion. It derives a CPT policy gradient theorem and develops a model-free CPT-PG algorithm, with theoretical guarantees of consistency and asymptotic convergence, achieving favorable sample complexity compared to zeroth-order methods. Through simulations in bandits, grid-worlds, finance, and control tasks, CPT-PG demonstrates nuanced behaviors (e.g., reflection effect) and scalability to larger state spaces, enabling more behaviorally faithful RL. The framework advances behaviorally aware sequential decision-making with potential impact across healthcare, finance, energy, and autonomous systems.

Abstract

Classical reinforcement learning (RL) typically assumes rational decision-making based on expected utility theory. However, this model has been shown to be empirically inconsistent with actual human preferences, as evidenced in psychology and behavioral economics. Cumulative Prospect Theory (CPT) provides a more nuanced model for human-based decision-making, capturing diverse attitudes and perceptions toward risk, gains, and losses. While prior work has integrated CPT with RL to solve CPT policy optimization problems, the understanding and impact of this formulation remain limited. Our contributions are as follows: (a) we derive a novel policy gradient theorem for CPT objectives, generalizing the foundational result in standard RL, (b) we design a model-free policy gradient algorithm for solving the CPT-RL problem, (c) we analyze our policy gradient estimator and prove asymptotic convergence of the algorithm to first-order stationary points, and (d) test its performance through simulations. Notably, our first-order policy gradient algorithm scales better than existing zeroth-order methods to larger state spaces. Our theoretical framework offers more flexibility to advance the integration of behavioral decision-making into RL.

A Prospect-Theoretic Policy Gradient Framework for Behaviorally Nuanced Reinforcement Learning

TL;DR

This work introduces a CPT-based reinforcement learning framework that models human risk attitudes via reference points, nonlinear utilities, and probability distortion. It derives a CPT policy gradient theorem and develops a model-free CPT-PG algorithm, with theoretical guarantees of consistency and asymptotic convergence, achieving favorable sample complexity compared to zeroth-order methods. Through simulations in bandits, grid-worlds, finance, and control tasks, CPT-PG demonstrates nuanced behaviors (e.g., reflection effect) and scalability to larger state spaces, enabling more behaviorally faithful RL. The framework advances behaviorally aware sequential decision-making with potential impact across healthcare, finance, energy, and autonomous systems.

Abstract

Classical reinforcement learning (RL) typically assumes rational decision-making based on expected utility theory. However, this model has been shown to be empirically inconsistent with actual human preferences, as evidenced in psychology and behavioral economics. Cumulative Prospect Theory (CPT) provides a more nuanced model for human-based decision-making, capturing diverse attitudes and perceptions toward risk, gains, and losses. While prior work has integrated CPT with RL to solve CPT policy optimization problems, the understanding and impact of this formulation remain limited. Our contributions are as follows: (a) we derive a novel policy gradient theorem for CPT objectives, generalizing the foundational result in standard RL, (b) we design a model-free policy gradient algorithm for solving the CPT-RL problem, (c) we analyze our policy gradient estimator and prove asymptotic convergence of the algorithm to first-order stationary points, and (d) test its performance through simulations. Notably, our first-order policy gradient algorithm scales better than existing zeroth-order methods to larger state spaces. Our theoretical framework offers more flexibility to advance the integration of behavioral decision-making into RL.
Paper Structure (55 sections, 12 theorems, 37 equations, 26 figures, 4 tables, 1 algorithm)

This paper contains 55 sections, 12 theorems, 37 equations, 26 figures, 4 tables, 1 algorithm.

Key Result

Proposition 2

There exist instances of cptpo where $\mathcal{U}$ is of the form $x\mapsto A+B\exp(Cx)$ for positive constants $A, B, C$ and cptpo does not admit an optimal policy in $\Pi_{M,NS}\,.$

Figures (26)

  • Figure 1: Comparison of our CPT-PG algorithm with vanilla PG (vPG) and exponential risk-sensitive (ERS-PG) on a simple 2-action bandit setting. (Upper fig.) Gain lottery setting: vPG trains a policy picking the risky action whereas CPT-PG and ERS-PG choose the safe one. (Lower fig.) Loss lottery: Only CPT picks the risky action. (Left) Recorded distorted returns, (center) evolution of probability of risky action along training steps, (right) Actions prescribed by trained policies. The shaded area is a range of $\pm$ one standard deviation with 5 independent runs with different seeds.
  • Figure 2: Compared performance of our algorithm and CPT-SPSA-G for $n=3,5,9$. The shaded area is a range of $\pm$ one standard deviation over 10 runs.
  • Figure 3: Performance of our PG algorithm on a financial trading application. KT refers to Kahneman and Tversky's utility function, $x_0$ is the reference point used in that utility, exp. refers to exponential and $\alpha$ is the parameter used in the definition of KT's utility. Shaded areas are interquantile (25-75%) margins and curves report the median values over 10 different runs.
  • Figure 4: Policy classes (see Rem. \ref{['remark_inclusions']}).
  • Figure 5: Heat maps representing the policies trained using our algorithm CPT-PG and an exponential risk-sensitive PG algorithm. Each cell in the $5 \times 5$ grid corresponds to a state, risky states are denoted by the letter 'R' in the cell, 'A' stands for the initial state and 'G' for the goal state, all the other states are considered safe (with a zero probability assigned). The color represents the probability of selecting a risky action at each state. The risky state (2,4) is risky in the sense that with low probability $0.01$ it leads to high reward of $10^6$ and a penalty of $-10^3$ otherwise. The main observation here is that CPT takes more risk in trying to end up in this risky state (low probability high reward) as you can see with the yellow cell above the risky state. Overall the risk profile is different for both methods. This is partly explained by the fact that CPT inflates low probability events thanks to the probability weight function.
  • ...and 21 more figures

Theorems & Definitions (22)

  • Remark 1
  • Proposition 2
  • Theorem 3
  • Proposition 4
  • Proposition 5: Consistency
  • Proposition 6: Sample complexity
  • Proposition 7: Asymptotic convergence
  • Remark 8
  • Proposition 9
  • proof
  • ...and 12 more