Table of Contents
Fetching ...

World Models as Reference Trajectories for Rapid Motor Adaptation

Carlos Stein Brito, Daniel McNamee

TL;DR

The paper tackles the problem of sustaining performance when real-world dynamics change by introducing Reflexive World Models (RWM), a dual-control framework that uses world-model predictions as implicit reference trajectories for rapid adaptation. A base RL policy operates in a learned latent space to maximize long-term reward, while a lightweight adaptive controller uses forward-model predictions to track those references, providing fast error correction with low online cost. The authors derive control-theoretic guarantees linking world-model accuracy, control authority, and external perturbations to bounded error and value performance, and demonstrate robust adaptation across high-dimensional locomotion tasks under actuator perturbations. Empirical results show that RWM achieves faster adaptation and higher robustness than model-based RL baselines and domain-randomized pre-training, illustrating a principled bridge between adaptive control and modern RL for reliable real-world deployment.

Abstract

Deploying learned control policies in real-world environments poses a fundamental challenge. When system dynamics change unexpectedly, performance degrades until models are retrained on new data. We introduce Reflexive World Models (RWM), a dual control framework that uses world model predictions as implicit reference trajectories for rapid adaptation. Our method separates the control problem into long-term reward maximization through reinforcement learning and robust motor execution through rapid latent control. This dual architecture achieves significantly faster adaptation with low online computational cost compared to model-based RL baselines, while maintaining near-optimal performance. The approach combines the benefits of flexible policy learning through reinforcement learning with rapid error correction capabilities, providing a principled approach to maintaining performance in high-dimensional continuous control tasks under varying dynamics.

World Models as Reference Trajectories for Rapid Motor Adaptation

TL;DR

The paper tackles the problem of sustaining performance when real-world dynamics change by introducing Reflexive World Models (RWM), a dual-control framework that uses world-model predictions as implicit reference trajectories for rapid adaptation. A base RL policy operates in a learned latent space to maximize long-term reward, while a lightweight adaptive controller uses forward-model predictions to track those references, providing fast error correction with low online cost. The authors derive control-theoretic guarantees linking world-model accuracy, control authority, and external perturbations to bounded error and value performance, and demonstrate robust adaptation across high-dimensional locomotion tasks under actuator perturbations. Empirical results show that RWM achieves faster adaptation and higher robustness than model-based RL baselines and domain-randomized pre-training, illustrating a principled bridge between adaptive control and modern RL for reliable real-world deployment.

Abstract

Deploying learned control policies in real-world environments poses a fundamental challenge. When system dynamics change unexpectedly, performance degrades until models are retrained on new data. We introduce Reflexive World Models (RWM), a dual control framework that uses world model predictions as implicit reference trajectories for rapid adaptation. Our method separates the control problem into long-term reward maximization through reinforcement learning and robust motor execution through rapid latent control. This dual architecture achieves significantly faster adaptation with low online computational cost compared to model-based RL baselines, while maintaining near-optimal performance. The approach combines the benefits of flexible policy learning through reinforcement learning with rapid error correction capabilities, providing a principled approach to maintaining performance in high-dimensional continuous control tasks under varying dynamics.

Paper Structure

This paper contains 21 sections, 4 theorems, 18 equations, 5 figures, 2 tables, 1 algorithm.

Key Result

Theorem 4.2

Under Assumption ass:system_properties, the control law $a_c = -\eta(\partial F/\partial a)^T e(t)$ achieves: where $\gamma = (1 - \eta\alpha^2 + \eta L^2) < 1$ for $\eta < 1/L^2$.

Figures (5)

  • Figure 1: (A) Network architecture of Reflexive World Models (RWM), showing the reinforcement learning policy (blue) and adaptive control modules (green), with interface variables in orange. Each transformation is implemented as a two-hidden-layer MLP. (B) Illustrative simulation of the adaptive control mechanism for a 2D pointmass task (without encoder, $z = s$). When actuators are perturbed, the trajectory deviates from the predicted future states $\hat{s}_{t+k}$ under the base policy actions $a_0$. This error triggers an update to generate corrective actions $a_c$. (C) Under alternating directional perturbations (red), RWM corrects deviations from the optimal trajectory, exhibiting characteristic after-effects when perturbations are removed.
  • Figure 2: RWM adaptation performance under step motor perturbations. The plots show Reward and Control Error over 800 episodes (left column); shaded areas indicate perturbation periods. The right column displays Normalized Median Reward and Control Error, aggregated across perturbation cycles. Shaded areas in the right plots represent the 95% confidence interval of the median (bootstrapped). RWM (orange line) consistently maintains higher reward and lower control error compared to No Adaptation (blue line) and the TD-MPC2 baseline (green line), demonstrating effective and rapid recovery from perturbations.
  • Figure 3: Nonstationary perturbations and high-dimensional coordination. (A) Walker2D under continuous filtered noise perturbations to actuator gains, following sinusoidal pattern $p$ (top). Time series (left) and averaged performance (right) show RWM achieves the highest reward (360.56), followed by TD-MPC2 (311.67), with No Adaptation performing worst (233.42). Control error measurements (bottom) demonstrate that RWM maintains systematically lower error throughout adaptation compared to both alternatives. (B) Analysis of the 17-actuator Humanoid environment showing coordinated movement patterns maintained by RWM even under perturbations.
  • Figure 4: Addressing challenges in baseline policy actions for effective adaptation. (A) A humanoid agent exhibiting "dead" behavior due to a simple quadratic action cost in its RL objective, leading to inaction. (B) The norm of actions for the simulation in (A), demonstrating a decay towards zero over training episodes as the agent minimizes the naive action cost. (C) Action component values over time for a standard TD-MPC2 policy (without the thresholded cost) in the Humanoid task, showing frequent saturation at the boundaries [-1, 1], which impedes gradient flow for the adaptive controller. (D) Smoother and bounded action values from a TD-MPC2 policy trained with the proposed thresholded quadratic action cost, maintaining differentiability and responsiveness.
  • Figure 5: Comparison with Domain-Randomized Baselines: Impact of pre-training with perturbations. Comparison of No Adaptation, RWM, and TD-MPC2 when baseline policies for No Adaptation and TD-MPC2 are pre-trained with exposure to actuator perturbations. (Left column) Reward and Control Error over episodes. (Right column) Normalized median reward and control error within perturbation cycles. While pre-training with perturbations improves the baseline, RWM (orange line) still demonstrates superior adaptation capabilities in terms of reward and control error compared to the pre-trained TD-MPC2 (green line) and the pre-trained No Adaptation policy (blue line).

Theorems & Definitions (6)

  • Theorem 4.2: Control Error
  • Theorem 4.3: Value Bounds
  • Theorem D.1: Control Error Bounds
  • proof
  • Theorem D.2: Performance Guarantees
  • proof