Table of Contents
Fetching ...

What Makes Value Learning Efficient in Residual Reinforcement Learning?

Guozheng Ma, Lu Li, Haoyu Wang, Zixuan Liu, Pierre-Luc Bacon, Dacheng Tao

TL;DR

This paper tackles value learning efficiency in residual RL, where a frozen base policy is refined by a bounded residual, formalized as $a = a_{base} + \lambda a_{res}$. It identifies two core bottlenecks: cold-start pathology (the critic lacks knowledge of the value landscape around the base policy) and structural scale mismatch (the residual is overwhelmed by the base action). The authors show that warmup data gathered from the base policy acts as a value anchor and that critic normalization (Layer Normalization) restores sensitivity, while distributional objectives offer no extra gains over standard mean-based value learning. They propose DAWN, combining Data-Anchored Warmup and Normalization, which yields substantial improvements in sample efficiency across ManiSkill and Adroit benchmarks and is robust to different base policies and observation modalities, relying on just two hyperparameters $M$ and $\lambda$.

Abstract

Residual reinforcement learning (RL) enables stable online refinement of expressive pretrained policies by freezing the base and learning only bounded corrections. However, value learning in residual RL poses unique challenges that remain poorly understood. In this work, we identify two key bottlenecks: cold start pathology, where the critic lacks knowledge of the value landscape around the base policy, and structural scale mismatch, where the residual contribution is dwarfed by the base action. Through systematic investigation, we uncover the mechanisms underlying these bottlenecks, revealing that simple yet principled solutions suffice: base-policy transitions serve as an essential value anchor for implicit warmup, and critic normalization effectively restores representation sensitivity for discerning value differences. Based on these insights, we propose DAWN (Data-Anchored Warmup and Normalization), a minimal approach targeting efficient value learning in residual RL. By addressing these bottlenecks, DAWN demonstrates substantial efficiency gains across diverse benchmarks, policy architectures, and observation modalities.

What Makes Value Learning Efficient in Residual Reinforcement Learning?

TL;DR

This paper tackles value learning efficiency in residual RL, where a frozen base policy is refined by a bounded residual, formalized as . It identifies two core bottlenecks: cold-start pathology (the critic lacks knowledge of the value landscape around the base policy) and structural scale mismatch (the residual is overwhelmed by the base action). The authors show that warmup data gathered from the base policy acts as a value anchor and that critic normalization (Layer Normalization) restores sensitivity, while distributional objectives offer no extra gains over standard mean-based value learning. They propose DAWN, combining Data-Anchored Warmup and Normalization, which yields substantial improvements in sample efficiency across ManiSkill and Adroit benchmarks and is robust to different base policies and observation modalities, relying on just two hyperparameters and .

Abstract

Residual reinforcement learning (RL) enables stable online refinement of expressive pretrained policies by freezing the base and learning only bounded corrections. However, value learning in residual RL poses unique challenges that remain poorly understood. In this work, we identify two key bottlenecks: cold start pathology, where the critic lacks knowledge of the value landscape around the base policy, and structural scale mismatch, where the residual contribution is dwarfed by the base action. Through systematic investigation, we uncover the mechanisms underlying these bottlenecks, revealing that simple yet principled solutions suffice: base-policy transitions serve as an essential value anchor for implicit warmup, and critic normalization effectively restores representation sensitivity for discerning value differences. Based on these insights, we propose DAWN (Data-Anchored Warmup and Normalization), a minimal approach targeting efficient value learning in residual RL. By addressing these bottlenecks, DAWN demonstrates substantial efficiency gains across diverse benchmarks, policy architectures, and observation modalities.
Paper Structure (68 sections, 14 equations, 23 figures, 10 tables, 1 algorithm)

This paper contains 68 sections, 14 equations, 23 figures, 10 tables, 1 algorithm.

Figures (23)

  • Figure 1: DAWN enables efficient value learning in residual RL. Aggregated success rates with Diffusion Policy as base policy across ManiSkill (3 tasks) and Adroit (3 tasks) benchmarks. DAWN achieves comparable final performance while converging approximately 5× faster than prior methods.
  • Figure 2: Effect of warmup data quantity on learning performance. (Left three) Learning curves across three ManiSkill tasks with varying amounts of warmup data. (Right) Success rate at the midpoint of training versus warmup data quantity. More warmup data consistently improves sample efficiency, with the effect most pronounced on challenging tasks. All experiments use 8 random seeds with shaded regions indicating 95% confidence intervals, a convention we follow throughout the paper.
  • Figure 3: Q-value grounding error during early training. Without warmup data, the error briefly decreases but quickly diverges. With warmup, the critic maintains accurate estimates throughout, confirming the value anchor effect.
  • Figure 4: Effect of explicit value warmup on learning performance. (Left three) Explicit warmup variants fail to improve and often degrade sample efficiency compared to implicit warmup alone. (Right) With automatic entropy tuning, $\alpha$ diverges during the warmup phase across all tasks, even with an initial value as small as $0.01$. Larger initial values lead to more severe divergence (see Appendix).
  • Figure 5: The failure mechanism of explicit value warmup. (Left) Q-value estimates during the warmup phase. Soft Q methods collapse to extreme negative values, while Hard Q remains near the true MC return. (Right) During explicit Soft Q warmup, the magnitude of $|\alpha \log \pi|$ substantially exceeds $|r|$.
  • ...and 18 more figures