Faster Reinforcement Learning by Freezing Slow States
Yijia Wang, Daniel R. Jiang
TL;DR
This paper introduces fast-slow MDPs where slow and fast state components evolve on different timescales and leverages a frozen-state approximation to ease computation. By solving a lower-level finite-horizon problem with slow states frozen and a higher-level infinite-horizon problem on a slower timescale, the authors derive frozen-state value iteration (FSVI) and its fitted variant (FSFVI) for both known-model and generative-model settings. They provide regret bounds that decompose error into reward approximation, horizon effects, and upper-level approximation, and they validate the approach on inventory control, gridworld, and dynamic pricing tasks, showing significant computational savings and improved policy quality compared to standard VI. The work offers a principled mechanism to exploit fast-slow structure in MDPs, guiding the selection of the freezing horizon $T$ and enabling scalable RL in long-horizon, high-frequency decision problems.
Abstract
We study infinite horizon Markov decision processes (MDPs) with "fast-slow" structure, where some state variables evolve rapidly ("fast states") while others change more gradually ("slow states"). This structure commonly arises in practice when decisions must be made at high frequencies over long horizons, and where slowly changing information still plays a critical role in determining optimal actions. Examples include inventory control under slowly changing demand indicators or dynamic pricing with gradually shifting consumer behavior. Modeling the problem at the natural decision frequency leads to MDPs with discount factors close to one, making them computationally challenging. We propose a novel approximation strategy that "freezes" slow states during phases of lower-level planning and subsequently applies value iteration to an auxiliary upper-level MDP that evolves on a slower timescale. Freezing states for short periods of time leads to easier-to-solve lower-level problems, while a slower upper-level timescale allows for a more favorable discount factor. On the theoretical side, we analyze the regret incurred by our frozen-state approach, which leads to simple insights on how to trade off regret versus computational cost. Empirically, we benchmark our new frozen-state methods on three domains, (i) inventory control with fixed order costs, (ii) a gridworld problem with spatial tasks, and (iii) dynamic pricing with reference-price effects. We demonstrate that the new methods produce high-quality policies with significantly less computation, and we show that simply omitting slow states is often a poor heuristic.
