Highway Reinforcement Learning
Yuhui Wang, Miroslav Strupl, Francesco Faccio, Qingyuan Wu, Haozhe Liu, Michał Grudzień, Xiaoyang Tan, Jürgen Schmidhuber
TL;DR
This work addresses the challenge of credit assignment with highly delayed rewards in off-policy RL. It identifies a fundamental underestimation problem in vanilla multi-step, IS-free operators and proposes Highway RL, which uses a highway gate to adaptively combine 1-step and distant lookahead information, ensuring convergence to the optimal VF for arbitrary lookahead and behavior. The paper introduces Highway Value Iteration and Highway DQN (including a Softmax variant) to instantiate the approach in model-based and model-free settings, respectively, and provides theoretical convergence guarantees along with extensive experiments on toy tasks, MinAtar, and delayed-reward benchmarks. The results demonstrate faster convergence and superior performance in delayed-reward environments, with ablations highlighting the gate, max/softmax aggregations, and lookahead-depth strategies as key drivers. Overall, Highway RL offers a principled, scalable framework for rapid, safe credit assignment across long temporal horizons, with practical impact for environments where rewards are significantly delayed.
Abstract
Learning from multi-step off-policy data collected by a set of policies is a core problem of reinforcement learning (RL). Approaches based on importance sampling (IS) often suffer from large variances due to products of IS ratios. Typical IS-free methods, such as $n$-step Q-learning, look ahead for $n$ time steps along the trajectory of actions (where $n$ is called the lookahead depth) and utilize off-policy data directly without any additional adjustment. They work well for proper choices of $n$. We show, however, that such IS-free methods underestimate the optimal value function (VF), especially for large $n$, restricting their capacity to efficiently utilize information from distant future time steps. To overcome this problem, we introduce a novel, IS-free, multi-step off-policy method that avoids the underestimation issue and converges to the optimal VF. At its core lies a simple but non-trivial \emph{highway gate}, which controls the information flow from the distant future by comparing it to a threshold. The highway gate guarantees convergence to the optimal VF for arbitrary $n$ and arbitrary behavioral policies. It gives rise to a novel family of off-policy RL algorithms that safely learn even when $n$ is very large, facilitating rapid credit assignment from the far future to the past. On tasks with greatly delayed rewards, including video games where the reward is given only at the end of the game, our new methods outperform many existing multi-step off-policy algorithms.
