Reinforcement Learning with Quasi-Hyperbolic Discounting
S. R. Eshwar, Mayank Motwani, Nibedita Roy, Gugan Thoppe
TL;DR
The paper addresses time-inconsistent behavior induced by quasi-hyperbolic discounting in reinforcement learning and proposes a model-free method to identify Markov Perfect Equilibria (MPE) as stable policies. It introduces a two-timescale critic-actor algorithm that estimates QH Q-values and converges to an MPE when the joint process stabilizes, leveraging a differential inclusion framework to handle non-uniqueness in greedy policies. Theoretical results guarantee boundedness of the critic and, under suitable conditions, convergence to an MPE, with practical validation in an inventory-management scenario that reveals multiple MPEs with varying profitability. This work advances the practical application of QH-discounted RL by providing a principled, model-free approach to equilibrium policy discovery and demonstrates its potential for human-behavior-inspired decision-making models.
Abstract
Reinforcement learning has traditionally been studied with exponential discounting or the average reward setup, mainly due to their mathematical tractability. However, such frameworks fall short of accurately capturing human behavior, which has a bias towards immediate gratification. Quasi-Hyperbolic (QH) discounting is a simple alternative for modeling this bias. Unlike in traditional discounting, though, the optimal QH-policy, starting from some time $t_1,$ can be different to the one starting from $t_2.$ Hence, the future self of an agent, if it is naive or impatient, can deviate from the policy that is optimal at the start, leading to sub-optimal overall returns. To prevent this behavior, an alternative is to work with a policy anchored in a Markov Perfect Equilibrium (MPE). In this work, we propose the first model-free algorithm for finding an MPE. Using a two-timescale analysis, we show that, if our algorithm converges, then the limit must be an MPE. We also validate this claim numerically for the standard inventory system with stochastic demands. Our work significantly advances the practical application of reinforcement learning.
