Faster Reinforcement Learning by Freezing Slow States

Yijia Wang; Daniel R. Jiang

Faster Reinforcement Learning by Freezing Slow States

Yijia Wang, Daniel R. Jiang

TL;DR

This paper introduces fast-slow MDPs where slow and fast state components evolve on different timescales and leverages a frozen-state approximation to ease computation. By solving a lower-level finite-horizon problem with slow states frozen and a higher-level infinite-horizon problem on a slower timescale, the authors derive frozen-state value iteration (FSVI) and its fitted variant (FSFVI) for both known-model and generative-model settings. They provide regret bounds that decompose error into reward approximation, horizon effects, and upper-level approximation, and they validate the approach on inventory control, gridworld, and dynamic pricing tasks, showing significant computational savings and improved policy quality compared to standard VI. The work offers a principled mechanism to exploit fast-slow structure in MDPs, guiding the selection of the freezing horizon $T$ and enabling scalable RL in long-horizon, high-frequency decision problems.

Abstract

We study infinite horizon Markov decision processes (MDPs) with "fast-slow" structure, where some state variables evolve rapidly ("fast states") while others change more gradually ("slow states"). This structure commonly arises in practice when decisions must be made at high frequencies over long horizons, and where slowly changing information still plays a critical role in determining optimal actions. Examples include inventory control under slowly changing demand indicators or dynamic pricing with gradually shifting consumer behavior. Modeling the problem at the natural decision frequency leads to MDPs with discount factors close to one, making them computationally challenging. We propose a novel approximation strategy that "freezes" slow states during phases of lower-level planning and subsequently applies value iteration to an auxiliary upper-level MDP that evolves on a slower timescale. Freezing states for short periods of time leads to easier-to-solve lower-level problems, while a slower upper-level timescale allows for a more favorable discount factor. On the theoretical side, we analyze the regret incurred by our frozen-state approach, which leads to simple insights on how to trade off regret versus computational cost. Empirically, we benchmark our new frozen-state methods on three domains, (i) inventory control with fixed order costs, (ii) a gridworld problem with spatial tasks, and (iii) dynamic pricing with reference-price effects. We demonstrate that the new methods produce high-quality policies with significantly less computation, and we show that simply omitting slow states is often a poor heuristic.

Faster Reinforcement Learning by Freezing Slow States

TL;DR

and enabling scalable RL in long-horizon, high-frequency decision problems.

Abstract

Paper Structure (47 sections, 25 theorems, 114 equations, 10 figures, 2 tables, 7 algorithms)

This paper contains 47 sections, 25 theorems, 114 equations, 10 figures, 2 tables, 7 algorithms.

Introduction
Main Contributions
Related Work
Fast-Slow MDPs
Base Model
Hierarchical Reformulation using Fixed-Horizon Policies
The Frozen-State Approximation
The Lower-Level MDP (Frozen Slow States)
The Upper-Level MDP (True State Dynamics)
Frozen-State Value Iteration
Natural Application Domains for FSVI
Computational Cost of FSVI
Theoretical Analysis
Reward Approximation Error
Defining Regret
...and 32 more sections

Key Result

Proposition 3.1

Given an MDP $\langle \mathcal{S}, \mathcal{A}, \mathcal{W}, f, r, \gamma \rangle$, the following hold:

Figures (10)

Figure 1: Illustration of a stationary policy $\mu$ (upper timeline) and a $T$-periodic policy $(\mu, \boldsymbol{\pi})$ (lower timeline) for $T=4$. The periods covered by the $T$-period reward associated with $(\mu, \boldsymbol{\pi})$ is visualized with brackets in the lower timeline.
Figure 2: A comparison of the lower-level problem of the hierarchical reformulation vs the lower-level problem of the frozen-state approximation.
Figure 3: Illustration of the upper-level problem. Notably, the discount factor is $\gamma^T$ and the reward function, from the point of view of $\mu$, depends on the lower-level value function $J_1$. This value function is computed by freezing states, as visualized by the grey box.
Figure 4: Reward approximation error comparison for two values of $L_f$.
Figure 5: Regret versus computational cost for Base VI and FSVI as the number of value iteration steps increases. In this example, we see three regimes depending on the computational budget: in the low budget regime, the lowest regret algorithm is FSVI with $T=16$; in the medium budget regime, FSVI with $T=9$ achieves lowest regret; and in the high budget regime, not surprisingly, Base VI achieves lowest regret.
...and 5 more figures

Theorems & Definitions (56)

Remark 1
Definition 1: Fast-Slow MDP
Proposition 3.1
proof
Remark 2
Remark 3: Non-zero terminal value
Proposition 6.1: Reward Approximation Error
proof
Definition 2: Regret
Remark 4
...and 46 more

Faster Reinforcement Learning by Freezing Slow States

TL;DR

Abstract

Faster Reinforcement Learning by Freezing Slow States

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (56)