Adapting Image-based RL Policies via Predicted Rewards
Weiyao Wang, Xinyuan Fang, Gregory D. Hager
TL;DR
The paper tackles generalization gaps in image-based reinforcement learning under visual domain shift where trajectory-level errors propagate. It introduces Predicted Reward Fine-Tuning (PRFT), which jointly learns a policy π and a reward predictor r_hat from observations, and then fine-tunes π in the target domain using predicted rewards under a MaxEnt objective; the predictor is frozen during deployment. Under domain shift, predicted rewards often follow a benign linear transform r_hat ≈ k r + b with k>0, preserving the optimal policy. Empirical results on six DMControl variants and sim-to-real transfer for a UR-5 robot show substantial improvements over strong baselines, indicating that imperfect reward signals can effectively guide domain adaptation in image-based RL.
Abstract
Image-based reinforcement learning (RL) faces significant challenges in generalization when the visual environment undergoes substantial changes between training and deployment. Under such circumstances, learned policies may not perform well leading to degraded results. Previous approaches to this problem have largely focused on broadening the training observation distribution, employing techniques like data augmentation and domain randomization. However, given the sequential nature of the RL decision-making problem, it is often the case that residual errors are propagated by the learned policy model and accumulate throughout the trajectory, resulting in highly degraded performance. In this paper, we leverage the observation that predicted rewards under domain shift, even though imperfect, can still be a useful signal to guide fine-tuning. We exploit this property to fine-tune a policy using reward prediction in the target domain. We have found that, even under significant domain shift, the predicted reward can still provide meaningful signal and fine-tuning substantially improves the original policy. Our approach, termed Predicted Reward Fine-tuning (PRFT), improves performance across diverse tasks in both simulated benchmarks and real-world experiments. More information is available at project web page: https://sites.google.com/view/prft.
