Table of Contents
Fetching ...

Self-adapting Robotic Agents through Online Continual Reinforcement Learning with World Model Feedback

Fabian Domberg, Georg Schildbach

TL;DR

This work presents a framework for online Continual Reinforcement Learning that enables automated adaptation during deployment and sketches out how autonomous robotic agents could once move beyond static training regimes toward adaptive systems capable of self-reflection and -improvement during operation, just like their biological counterparts.

Abstract

As learning-based robotic controllers are typically trained offline and deployed with fixed parameters, their ability to cope with unforeseen changes during operation is limited. Biologically inspired, this work presents a framework for online Continual Reinforcement Learning that enables automated adaptation during deployment. Building on DreamerV3, a model-based Reinforcement Learning algorithm, the proposed method leverages world model prediction residuals to detect out-of-distribution events and automatically trigger finetuning. Adaptation progress is monitored using both task-level performance signals and internal training metrics, allowing convergence to be assessed without external supervision and domain knowledge. The approach is validated on a variety of contemporary continuous control problems, including a quadruped robot in high-fidelity simulation, and a real-world model vehicle. Relevant metrics and their interpretation are presented and discussed, as well as resulting trade-offs described. The results sketch out how autonomous robotic agents could once move beyond static training regimes toward adaptive systems capable of self-reflection and -improvement during operation, just like their biological counterparts.

Self-adapting Robotic Agents through Online Continual Reinforcement Learning with World Model Feedback

TL;DR

This work presents a framework for online Continual Reinforcement Learning that enables automated adaptation during deployment and sketches out how autonomous robotic agents could once move beyond static training regimes toward adaptive systems capable of self-reflection and -improvement during operation, just like their biological counterparts.

Abstract

As learning-based robotic controllers are typically trained offline and deployed with fixed parameters, their ability to cope with unforeseen changes during operation is limited. Biologically inspired, this work presents a framework for online Continual Reinforcement Learning that enables automated adaptation during deployment. Building on DreamerV3, a model-based Reinforcement Learning algorithm, the proposed method leverages world model prediction residuals to detect out-of-distribution events and automatically trigger finetuning. Adaptation progress is monitored using both task-level performance signals and internal training metrics, allowing convergence to be assessed without external supervision and domain knowledge. The approach is validated on a variety of contemporary continuous control problems, including a quadruped robot in high-fidelity simulation, and a real-world model vehicle. Relevant metrics and their interpretation are presented and discussed, as well as resulting trade-offs described. The results sketch out how autonomous robotic agents could once move beyond static training regimes toward adaptive systems capable of self-reflection and -improvement during operation, just like their biological counterparts.
Paper Structure (16 sections, 7 equations, 3 figures)

This paper contains 16 sections, 7 equations, 3 figures.

Figures (3)

  • Figure 1: Experiment on DMC's Walker Walk problem. At $5,000$ steps a random joint’s gear ratio is reduced, causing the Walker to lose its balance repeatedly. The proposed method detects the change and initiates finetuning, leading to recovery. Averaged over ten runs.
  • Figure 2: Simulated experiment using quadruped robot ANYmal, averaged over nine runs. At $9,000$ steps, velocity limits of the right hind leg actuators are reduced, resulting in unstable locomotion and decreased reward. Our method detects the change and initiates finetuning, restoring stable walking. A failed run additionally illustrates non-convergent behavior.
  • Figure 3: Experiment moving a trained model from simulation to a real, $1$:$10$ scale vehicle. At $10,000$ steps the policy is transferred to the car, leading to a surge in prediction residuals and decreased reward. Finetuning stabilizes behavior and slowly recovers. After $50,000$ steps, rear-wheel friction is reduced, causing a secondary adaption phase and subsequent recovery.