Table of Contents
Fetching ...

Highway Graph to Accelerate Reinforcement Learning

Zidu Yin, Zhen Zhang, Dong Gong, Stefano V. Albrecht, Javen Q. Shi

TL;DR

The paper introduces the highway graph to accelerate reinforcement learning by compressing the empirical state-transition graph into a smaller highway graph, enabling multi-step value propagation and faster convergence of value updates. It defines the highway MDP and a graph Bellman operator to perform complete value updates on the reduced graph, with convergence guarantees. Empirical results across four task categories show speedups of roughly 10x to over 150x while preserving or improving returns, and a neural network re-parameterization (HG-Q) provides storage-efficient, generalizable policy initialization. The approach yields significant training-time gains, strong sample efficiency, and improved generalization, with a clear path to extending the method to more complex and stochastic environments. The work offers a practical, scalable mechanism to enhance VI-based RL and lays groundwork for broader integration into RL algorithms.

Abstract

Reinforcement Learning (RL) algorithms often struggle with low training efficiency. A common approach to address this challenge is integrating model-based planning algorithms, such as Monte Carlo Tree Search (MCTS) or Value Iteration (VI), into the environmental model. However, VI requires iterating over a large tensor which updates the value of the preceding state based on the succeeding state through value propagation, resulting in computationally intensive operations. To enhance the RL training efficiency, we propose improving the efficiency of the value learning process. In deterministic environments with discrete state and action spaces, we observe that on the sampled empirical state-transition graph, a non-branching sequence of transitions-termed a highway-can take the agent to another state without deviation through intermediate states. On these non-branching highways, the value-updating process can be streamlined into a single-step operation, eliminating the need for step-by-step updates. Building on this observation, we introduce the highway graph to model state transitions. The highway graph compresses the transition model into a compact representation, where edges can encapsulate multiple state transitions, enabling value propagation across multiple time steps in a single iteration. By integrating the highway graph into RL, the training process is significantly accelerated, particularly in the early stages of training. Experiments across four categories of environments demonstrate that our method learns significantly faster than established and state-of-the-art RL algorithms (often by a factor of 10 to 150) while maintaining equal or superior expected returns. Furthermore, a deep neural network-based agent trained using the highway graph exhibits improved generalization capabilities and reduced storage costs. Code is publicly available at https://github.com/coodest/highwayRL.

Highway Graph to Accelerate Reinforcement Learning

TL;DR

The paper introduces the highway graph to accelerate reinforcement learning by compressing the empirical state-transition graph into a smaller highway graph, enabling multi-step value propagation and faster convergence of value updates. It defines the highway MDP and a graph Bellman operator to perform complete value updates on the reduced graph, with convergence guarantees. Empirical results across four task categories show speedups of roughly 10x to over 150x while preserving or improving returns, and a neural network re-parameterization (HG-Q) provides storage-efficient, generalizable policy initialization. The approach yields significant training-time gains, strong sample efficiency, and improved generalization, with a clear path to extending the method to more complex and stochastic environments. The work offers a practical, scalable mechanism to enhance VI-based RL and lays groundwork for broader integration into RL algorithms.

Abstract

Reinforcement Learning (RL) algorithms often struggle with low training efficiency. A common approach to address this challenge is integrating model-based planning algorithms, such as Monte Carlo Tree Search (MCTS) or Value Iteration (VI), into the environmental model. However, VI requires iterating over a large tensor which updates the value of the preceding state based on the succeeding state through value propagation, resulting in computationally intensive operations. To enhance the RL training efficiency, we propose improving the efficiency of the value learning process. In deterministic environments with discrete state and action spaces, we observe that on the sampled empirical state-transition graph, a non-branching sequence of transitions-termed a highway-can take the agent to another state without deviation through intermediate states. On these non-branching highways, the value-updating process can be streamlined into a single-step operation, eliminating the need for step-by-step updates. Building on this observation, we introduce the highway graph to model state transitions. The highway graph compresses the transition model into a compact representation, where edges can encapsulate multiple state transitions, enabling value propagation across multiple time steps in a single iteration. By integrating the highway graph into RL, the training process is significantly accelerated, particularly in the early stages of training. Experiments across four categories of environments demonstrate that our method learns significantly faster than established and state-of-the-art RL algorithms (often by a factor of 10 to 150) while maintaining equal or superior expected returns. Furthermore, a deep neural network-based agent trained using the highway graph exhibits improved generalization capabilities and reduced storage costs. Code is publicly available at https://github.com/coodest/highwayRL.
Paper Structure (43 sections, 3 theorems, 35 equations, 19 figures, 3 tables, 2 algorithms)

This paper contains 43 sections, 3 theorems, 35 equations, 19 figures, 3 tables, 2 algorithms.

Key Result

Lemma 1

Denote the state value vector $v = \left[\hat{V}(s^{}_{}<1>), \hat{V}(s^{}_{}<2>), \cdots, \hat{V}(s^{}_{}<i>), \cdots \right]^{\intercal} \in \mathcal{V}$, where $\mathcal{V} \in \mathbb{R}^{|\mathcal{S}^{}_{}<inter>|}$, $\hat{V}$ is an estimate of state value function $V$, and $s^{}_{}<i> \in \mat

Figures (19)

  • Figure 1: Comparison of the state-transition graph and its corresponding highway graph. An RL agent on the state-transition graph learns/propagates values state-by-state per learning step, and the highway graph propagates values for a stack of states in the highway per learning step. The highway graph is much smaller than the state-transition graph which leads to more efficient value learning. Take the Atari game Star Gunner as an example, the 234,521 states in the sampled empirical state-transition graph are represented by highway graph with only 1,084 (0.5% of its original size) states connected by highways.
  • Figure 2: A comparison of converged time of training (within one million frames) and corresponding speedups by the highway graph compared to baselines. The first row of images are example states from each environment. The results demonstrate a 10 to more than 150 times faster RL agent training compared to baselines when adopting the highway graph. All the experiments were performed on the same machine with a 12-core CPU and 128 GB Memory.
  • Figure 3: Types of value updating.
  • Figure 4: Overall data flow of our highway graph RL method. The actor (on the left) sends the sampled transitions by the behavior policy to the learner (on the right) which (1) constructs the empirical state-transition graph with rewards (in Section \ref{['sec:st-graph']}); (2) converts the empirical state-transition graph to the corresponding highway graph (Section \ref{['sec:highway-graph']}); (3) updates the value of state-actions in the highway graph by an improved value iteration algorithm and re-parameterize the highway graph to a neural network-based agent as the new behavior policy (Section \ref{['sec:agent-policy']}).
  • Figure 5: Intuitive idea of the highway graphs. The path without branching in the empirical state-transition graph can be merged as a highway. Value updating on the highway graph will be much less computationally extensive due to the dramatic reduction of the nodes and edges.
  • ...and 14 more figures

Theorems & Definitions (13)

  • Definition 1: Empirical Transition and Reward Functions
  • Definition 2: Empirical State-Transition Graph
  • Definition 3: Highway
  • Definition 4: The highway graph
  • Definition 5: Highway MDP
  • Lemma 1
  • Definition 6: Graph Bellman operator
  • Lemma 2
  • Proposition 1: Convergence of value updating on highway graphs.
  • Remark 1
  • ...and 3 more