Reinforcement Learning with Quasi-Hyperbolic Discounting

S. R. Eshwar; Mayank Motwani; Nibedita Roy; Gugan Thoppe

Reinforcement Learning with Quasi-Hyperbolic Discounting

S. R. Eshwar, Mayank Motwani, Nibedita Roy, Gugan Thoppe

TL;DR

The paper addresses time-inconsistent behavior induced by quasi-hyperbolic discounting in reinforcement learning and proposes a model-free method to identify Markov Perfect Equilibria (MPE) as stable policies. It introduces a two-timescale critic-actor algorithm that estimates QH Q-values and converges to an MPE when the joint process stabilizes, leveraging a differential inclusion framework to handle non-uniqueness in greedy policies. Theoretical results guarantee boundedness of the critic and, under suitable conditions, convergence to an MPE, with practical validation in an inventory-management scenario that reveals multiple MPEs with varying profitability. This work advances the practical application of QH-discounted RL by providing a principled, model-free approach to equilibrium policy discovery and demonstrates its potential for human-behavior-inspired decision-making models.

Abstract

Reinforcement learning has traditionally been studied with exponential discounting or the average reward setup, mainly due to their mathematical tractability. However, such frameworks fall short of accurately capturing human behavior, which has a bias towards immediate gratification. Quasi-Hyperbolic (QH) discounting is a simple alternative for modeling this bias. Unlike in traditional discounting, though, the optimal QH-policy, starting from some time $t_1,$ can be different to the one starting from $t_2.$ Hence, the future self of an agent, if it is naive or impatient, can deviate from the policy that is optimal at the start, leading to sub-optimal overall returns. To prevent this behavior, an alternative is to work with a policy anchored in a Markov Perfect Equilibrium (MPE). In this work, we propose the first model-free algorithm for finding an MPE. Using a two-timescale analysis, we show that, if our algorithm converges, then the limit must be an MPE. We also validate this claim numerically for the standard inventory system with stochastic demands. Our work significantly advances the practical application of reinforcement learning.

Reinforcement Learning with Quasi-Hyperbolic Discounting

TL;DR

Abstract

can be different to the one starting from

Hence, the future self of an agent, if it is naive or impatient, can deviate from the policy that is optimal at the start, leading to sub-optimal overall returns. To prevent this behavior, an alternative is to work with a policy anchored in a Markov Perfect Equilibrium (MPE). In this work, we propose the first model-free algorithm for finding an MPE. Using a two-timescale analysis, we show that, if our algorithm converges, then the limit must be an MPE. We also validate this claim numerically for the standard inventory system with stochastic demands. Our work significantly advances the practical application of reinforcement learning.

Paper Structure (10 sections, 2 theorems, 10 equations, 2 figures, 7 tables, 1 algorithm)

This paper contains 10 sections, 2 theorems, 10 equations, 2 figures, 7 tables, 1 algorithm.

Introduction
Setup, Goal, Algorithm, and Main Results
Setup and Goal
MPE-learning Algorithm
Main Results
Our Algorithm Design
Proof Outlines
Experiments
Conclusion and Future Directions
Acknowledgements

Key Result

Theorem 1

Suppose a:stepsize and a:reward are true. Then the following statements hold for the iterates $(W_n)$ and $(\theta_n)$ obtained from Algorithm alg:sync-algo:

Figures (2)

Figure 1: \ref{['fig:discount_factors_comparison']} Comparison of discount factors under exponential, hyperbolic, and quasi-hyperbolic discounting models. \ref{['fig:two-state.MDP']} A two-state MDP example where the action set of state $1$ is $\{a_1, a_2\},$ while that of state $2$ is $\{a_1\}.$ For each tuple on the arrow, the first element is the probability of the transition, the second is the action taken, and the third is the instantaneous reward. \ref{['tab:QH.Q-Values']} Q-values under QH-discounting for the policies $\bar{f} \equiv (f, f, \ldots), \bar{g} \equiv (g, g, \ldots),$ and $\bar{h} \equiv (h, h, \ldots),$ where $f(1) = f(2) = g(2) = h(2) = a_1,$ while $g(1) = a_2$ and $h(a_1|1) = h(a_2|1) = 0.5.$ Each row in the table refer to the $(s, a)$ pairs, while the columns represent the corresponding policies.
Figure 2: Vector fields for the DI in \ref{['eq:slow_ts_DI']} for the MDP given in Fig. \ref{['fig:two-state.MDP']} with $\gamma = 0.8$ and $\sigma$ values of $0.3, 0.5,$ and $0.7.$ Here, the orange dot is the vector $Q^{\sigma,\gamma}_{\bar{g}},$ blue is $Q^{\sigma,\gamma}_{\bar{f}},$ while green is $Q^{\sigma,\gamma}_{\bar{h}},$ where $\bar{f}, \bar{g}$ and $\bar{h}$ are as in Fig. \ref{['cap:defn.f.g.h']}'s caption.

Theorems & Definitions (3)

Theorem 1
Remark 2
Proposition 1

Reinforcement Learning with Quasi-Hyperbolic Discounting

TL;DR

Abstract

Reinforcement Learning with Quasi-Hyperbolic Discounting

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (3)