What Makes Value Learning Efficient in Residual Reinforcement Learning?
Guozheng Ma, Lu Li, Haoyu Wang, Zixuan Liu, Pierre-Luc Bacon, Dacheng Tao
TL;DR
This paper tackles value learning efficiency in residual RL, where a frozen base policy is refined by a bounded residual, formalized as $a = a_{base} + \lambda a_{res}$. It identifies two core bottlenecks: cold-start pathology (the critic lacks knowledge of the value landscape around the base policy) and structural scale mismatch (the residual is overwhelmed by the base action). The authors show that warmup data gathered from the base policy acts as a value anchor and that critic normalization (Layer Normalization) restores sensitivity, while distributional objectives offer no extra gains over standard mean-based value learning. They propose DAWN, combining Data-Anchored Warmup and Normalization, which yields substantial improvements in sample efficiency across ManiSkill and Adroit benchmarks and is robust to different base policies and observation modalities, relying on just two hyperparameters $M$ and $\lambda$.
Abstract
Residual reinforcement learning (RL) enables stable online refinement of expressive pretrained policies by freezing the base and learning only bounded corrections. However, value learning in residual RL poses unique challenges that remain poorly understood. In this work, we identify two key bottlenecks: cold start pathology, where the critic lacks knowledge of the value landscape around the base policy, and structural scale mismatch, where the residual contribution is dwarfed by the base action. Through systematic investigation, we uncover the mechanisms underlying these bottlenecks, revealing that simple yet principled solutions suffice: base-policy transitions serve as an essential value anchor for implicit warmup, and critic normalization effectively restores representation sensitivity for discerning value differences. Based on these insights, we propose DAWN (Data-Anchored Warmup and Normalization), a minimal approach targeting efficient value learning in residual RL. By addressing these bottlenecks, DAWN demonstrates substantial efficiency gains across diverse benchmarks, policy architectures, and observation modalities.
