Table of Contents
Fetching ...

ProgAgent:A Continual RL Agent with Progress-Aware Rewards

Jinzhou Tan, Gabriel Adineera, Jinoh Kim

TL;DR

ProgAgent is a continual reinforcement learning (CRL) agent that unifies progress-aware reward learning with a high-throughput, JAX-native system architecture, and incorporates an adversarial push-back refinement that regularizes the reward model, curbing overconfident predictions on non-expert trajectories and countering distribution shift.

Abstract

We present ProgAgent, a continual reinforcement learning (CRL) agent that unifies progress-aware reward learning with a high-throughput, JAX-native system architecture. Lifelong robotic learning grapples with catastrophic forgetting and the high cost of reward specification. ProgAgent tackles these by deriving dense, shaped rewards from unlabeled expert videos through a perceptual model that estimates task progress across initial, current, and goal observations. We theoretically interpret this as a learned state-potential function, delivering robust guidance in line with expert behaviors. To maintain stability amid online exploration - where novel, out-of-distribution states arise - we incorporate an adversarial push-back refinement that regularizes the reward model, curbing overconfident predictions on non-expert trajectories and countering distribution shift. By embedding this reward mechanism into a JIT-compiled loop, ProgAgent supports massively parallel rollouts and fully differentiable updates, rendering a sophisticated unified objective feasible: it merges PPO with coreset replay and synaptic intelligence for an enhanced stability-plasticity balance. Evaluations on ContinualBench and Meta-World benchmarks highlight ProgAgent's advantages: it markedly reduces forgetting, boosts learning speed, and outperforms key baselines in visual reward learning (e.g., Rank2Reward, TCN) and continual learning (e.g., Coreset, SI) - surpassing even an idealized perfect memory agent. Real-robot trials further validate its ability to acquire complex manipulation skills from noisy, few-shot human demonstrations.

ProgAgent:A Continual RL Agent with Progress-Aware Rewards

TL;DR

ProgAgent is a continual reinforcement learning (CRL) agent that unifies progress-aware reward learning with a high-throughput, JAX-native system architecture, and incorporates an adversarial push-back refinement that regularizes the reward model, curbing overconfident predictions on non-expert trajectories and countering distribution shift.

Abstract

We present ProgAgent, a continual reinforcement learning (CRL) agent that unifies progress-aware reward learning with a high-throughput, JAX-native system architecture. Lifelong robotic learning grapples with catastrophic forgetting and the high cost of reward specification. ProgAgent tackles these by deriving dense, shaped rewards from unlabeled expert videos through a perceptual model that estimates task progress across initial, current, and goal observations. We theoretically interpret this as a learned state-potential function, delivering robust guidance in line with expert behaviors. To maintain stability amid online exploration - where novel, out-of-distribution states arise - we incorporate an adversarial push-back refinement that regularizes the reward model, curbing overconfident predictions on non-expert trajectories and countering distribution shift. By embedding this reward mechanism into a JIT-compiled loop, ProgAgent supports massively parallel rollouts and fully differentiable updates, rendering a sophisticated unified objective feasible: it merges PPO with coreset replay and synaptic intelligence for an enhanced stability-plasticity balance. Evaluations on ContinualBench and Meta-World benchmarks highlight ProgAgent's advantages: it markedly reduces forgetting, boosts learning speed, and outperforms key baselines in visual reward learning (e.g., Rank2Reward, TCN) and continual learning (e.g., Coreset, SI) - surpassing even an idealized perfect memory agent. Real-robot trials further validate its ability to acquire complex manipulation skills from noisy, few-shot human demonstrations.
Paper Structure (23 sections, 8 equations, 7 figures, 1 table, 1 algorithm)

This paper contains 23 sections, 8 equations, 7 figures, 1 table, 1 algorithm.

Figures (7)

  • Figure 1: Overview of ProgAgent's training pipeline, illustrating the unified loop of reward learning, policy optimization, and continual mechanisms under JAX acceleration.
  • Figure 2: The complete training pipeline of ProgAgent. Our approach utilizes two parallel data streams. Top path An offline stream uses unlabeled expert videos to train a Progress Prediction Network. The network's output is optimized via a contrastive expert loss ($L_{\text{expert}}$) to accurately estimate task progress. Bottom path An online stream collects agent experience through large-scale parallel rollouts. To ensure robustness against out-of-distribution states, an adversarial push-back loss ($L_{\text{push}}$) regularizes the reward function by pushing predictions on these non-expert trajectories toward a low-confidence (zero-mean, high-variance) prior. These two objectives are combined into a final loss, $L_{\text{total}}$, to produce a dense and stable reward signal for policy training.
  • Figure 3: Qualitative comparison of the learned policies for the Button-Press task in the ContinualBench environment. The sequences display the agent's behavior at the end of training. From top to bottom: a standard Finetuning agent, the Online Agent, and our ProgAgent. ProgAgent successfully learns a direct and stable policy to complete the task, whereas the baseline methods exhibit less optimal or failed behaviors.
  • Figure 4: Comparative performance of ProgAgent against baseline methods on the final evaluation reward after sequential training on three tasks from ContinualBench: button-press, door-open, and window-close. Bars indicate the mean reward. ProgAgent consistently outperforms all other methods across all tasks.
  • Figure 5: Learning curves of reward versus training steps on ContinualBench tasks. Shaded regions indicate standard deviation. ProgAgent (yellow) shows consistently faster learning and achieves higher final rewards across all tasks, demonstrating its superior sample efficiency.
  • ...and 2 more figures