Table of Contents
Fetching ...

Adapting Image-based RL Policies via Predicted Rewards

Weiyao Wang, Xinyuan Fang, Gregory D. Hager

TL;DR

The paper tackles generalization gaps in image-based reinforcement learning under visual domain shift where trajectory-level errors propagate. It introduces Predicted Reward Fine-Tuning (PRFT), which jointly learns a policy π and a reward predictor r_hat from observations, and then fine-tunes π in the target domain using predicted rewards under a MaxEnt objective; the predictor is frozen during deployment. Under domain shift, predicted rewards often follow a benign linear transform r_hat ≈ k r + b with k>0, preserving the optimal policy. Empirical results on six DMControl variants and sim-to-real transfer for a UR-5 robot show substantial improvements over strong baselines, indicating that imperfect reward signals can effectively guide domain adaptation in image-based RL.

Abstract

Image-based reinforcement learning (RL) faces significant challenges in generalization when the visual environment undergoes substantial changes between training and deployment. Under such circumstances, learned policies may not perform well leading to degraded results. Previous approaches to this problem have largely focused on broadening the training observation distribution, employing techniques like data augmentation and domain randomization. However, given the sequential nature of the RL decision-making problem, it is often the case that residual errors are propagated by the learned policy model and accumulate throughout the trajectory, resulting in highly degraded performance. In this paper, we leverage the observation that predicted rewards under domain shift, even though imperfect, can still be a useful signal to guide fine-tuning. We exploit this property to fine-tune a policy using reward prediction in the target domain. We have found that, even under significant domain shift, the predicted reward can still provide meaningful signal and fine-tuning substantially improves the original policy. Our approach, termed Predicted Reward Fine-tuning (PRFT), improves performance across diverse tasks in both simulated benchmarks and real-world experiments. More information is available at project web page: https://sites.google.com/view/prft.

Adapting Image-based RL Policies via Predicted Rewards

TL;DR

The paper tackles generalization gaps in image-based reinforcement learning under visual domain shift where trajectory-level errors propagate. It introduces Predicted Reward Fine-Tuning (PRFT), which jointly learns a policy π and a reward predictor r_hat from observations, and then fine-tunes π in the target domain using predicted rewards under a MaxEnt objective; the predictor is frozen during deployment. Under domain shift, predicted rewards often follow a benign linear transform r_hat ≈ k r + b with k>0, preserving the optimal policy. Empirical results on six DMControl variants and sim-to-real transfer for a UR-5 robot show substantial improvements over strong baselines, indicating that imperfect reward signals can effectively guide domain adaptation in image-based RL.

Abstract

Image-based reinforcement learning (RL) faces significant challenges in generalization when the visual environment undergoes substantial changes between training and deployment. Under such circumstances, learned policies may not perform well leading to degraded results. Previous approaches to this problem have largely focused on broadening the training observation distribution, employing techniques like data augmentation and domain randomization. However, given the sequential nature of the RL decision-making problem, it is often the case that residual errors are propagated by the learned policy model and accumulate throughout the trajectory, resulting in highly degraded performance. In this paper, we leverage the observation that predicted rewards under domain shift, even though imperfect, can still be a useful signal to guide fine-tuning. We exploit this property to fine-tune a policy using reward prediction in the target domain. We have found that, even under significant domain shift, the predicted reward can still provide meaningful signal and fine-tuning substantially improves the original policy. Our approach, termed Predicted Reward Fine-tuning (PRFT), improves performance across diverse tasks in both simulated benchmarks and real-world experiments. More information is available at project web page: https://sites.google.com/view/prft.
Paper Structure (7 sections, 4 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 7 sections, 4 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: Left: Example of source environment observations and target environment observations. Right: Illustration of domain shift effect on reward prediction. Samples are collected using a trained policy on the source walker walk environment, and the domain shift effect is tested by evaluating predicted rewards under both the source and the target environment (video_hard in DMControl GB) with the same underlying states. Fitted linear regression for the predicted rewards against the groundtruth rewards for both source and target environment are plotted for visualization.
  • Figure 2: Left: During training, we optimize the reward prediction module along with reinforcement learning using sampled transition tuples from replay buffer. Right: During deployment finetuning, we use the transition tuples with predicted reward to finetune the reinforcement learning policy. The reward prediction module is frozen in this stage.
  • Figure 3: Samples from deepmind control suite (DMControl), deepmind control generalization benchmark (DMControl GB) with video background (easy and hard), and distracting control suite (Distracting CS) with intensity from 0.1 to 0.5.
  • Figure 4: Top: Policies improve during fine-tuning using predicted rewards. Average episodic rewards over four tasks and four random seeds are plotted. Bottom: Relative improvement of average rewards across different distraction intensities at 10K and 50K fine-tuning steps.
  • Figure 5: Evaluation in environments under distracting control suite with varying degrees of distraction intensities. Our method significantly outperforms baseline methods in five out of six environments. Error bar shows one standard deviation.
  • ...and 1 more figures