Table of Contents
Fetching ...

AdaWorldPolicy: World-Model-Driven Diffusion Policy with Online Adaptive Learning for Robotic Manipulation

Ge Yuan, Qiyuan Qiao, Jing Zhang, Dong Xu

TL;DR

This work introduces a unified framework, World-Model-Driven Diffusion Policy with Online Adaptive Learning (AdaWorldPolicy), to enhance robotic manipulation under dynamic conditions with minimal human involvement, with dynamical adaptive capacity to out-of-distribution scenarios.

Abstract

Effective robotic manipulation requires policies that can anticipate physical outcomes and adapt to real-world environments. Effective robotic manipulation requires policies that can anticipate physical outcomes and adapt to real-world environments. In this work, we introduce a unified framework, World-Model-Driven Diffusion Policy with Online Adaptive Learning (AdaWorldPolicy) to enhance robotic manipulation under dynamic conditions with minimal human involvement. Our core insight is that world models provide strong supervision signals, enabling online adaptive learning in dynamic environments, which can be complemented by force-torque feedback to mitigate dynamic force shifts. Our AdaWorldPolicy integrates a world model, an action expert, and a force predictor-all implemented as interconnected Flow Matching Diffusion Transformers (DiT). They are interconnected via the multi-modal self-attention layers, enabling deep feature exchange for joint learning while preserving their distinct modularity characteristics. We further propose a novel Online Adaptive Learning (AdaOL) strategy that dynamically switches between an Action Generation mode and a Future Imagination mode to drive reactive updates across all three modules. This creates a powerful closed-loop mechanism that adapts to both visual and physical domain shifts with minimal overhead. Across a suite of simulated and real-robot benchmarks, our AdaWorldPolicy achieves state-of-the-art performance, with dynamical adaptive capacity to out-of-distribution scenarios.

AdaWorldPolicy: World-Model-Driven Diffusion Policy with Online Adaptive Learning for Robotic Manipulation

TL;DR

This work introduces a unified framework, World-Model-Driven Diffusion Policy with Online Adaptive Learning (AdaWorldPolicy), to enhance robotic manipulation under dynamic conditions with minimal human involvement, with dynamical adaptive capacity to out-of-distribution scenarios.

Abstract

Effective robotic manipulation requires policies that can anticipate physical outcomes and adapt to real-world environments. Effective robotic manipulation requires policies that can anticipate physical outcomes and adapt to real-world environments. In this work, we introduce a unified framework, World-Model-Driven Diffusion Policy with Online Adaptive Learning (AdaWorldPolicy) to enhance robotic manipulation under dynamic conditions with minimal human involvement. Our core insight is that world models provide strong supervision signals, enabling online adaptive learning in dynamic environments, which can be complemented by force-torque feedback to mitigate dynamic force shifts. Our AdaWorldPolicy integrates a world model, an action expert, and a force predictor-all implemented as interconnected Flow Matching Diffusion Transformers (DiT). They are interconnected via the multi-modal self-attention layers, enabling deep feature exchange for joint learning while preserving their distinct modularity characteristics. We further propose a novel Online Adaptive Learning (AdaOL) strategy that dynamically switches between an Action Generation mode and a Future Imagination mode to drive reactive updates across all three modules. This creates a powerful closed-loop mechanism that adapts to both visual and physical domain shifts with minimal overhead. Across a suite of simulated and real-robot benchmarks, our AdaWorldPolicy achieves state-of-the-art performance, with dynamical adaptive capacity to out-of-distribution scenarios.
Paper Structure (45 sections, 2 equations, 7 figures, 6 tables)

This paper contains 45 sections, 2 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: An overview of our AdaWorldPolicy with Adaptive Online Learning (AdaOL). At timestep $t$, our AdaWorldPolicy Network operates in two modes. Mode I (Action Generation): AdaWorldPolicy network acts as an action policy generator $P^{\text{policy}}(a|o)$, which takes the current multi-modal observation $o$ (from static/gripper cameras and force sensor) to generate an action $a$. This action is then executed by the robot. During offline training, this step is supervised by the imitation loss $\mathcal{L}_1$ (see Section \ref{['sec:model_and_offline_training']}). Mode II (Future Imagination): Subsequently, our AdaWorldPolicy network turns into an action-conditioned world model $P^{\text{imagine}}(o'|o, a)$ which takes the same observation $o$ and the executed action $a$ to predict an Imagined Observation at timestep $t+1$. The core of our AdaOL strategy lies in the online updating loop (red arrows). The discrepancy between the Imagined Observation and the real Observation at timestep $t+1$ (e.g., under in-domain setup or under domain shifts like lighting or pose variations) is quantified by a prediction loss $\mathcal{L}_2$. This loss drives an online update to a small subset of shared network parameters, creating a closed-loop system that continuously adapts to real-world dynamics.
  • Figure 2: Network architecture and workflow of AdaWorldPolicy. Our unified multi-modal framework builds upon a shared multi-modal transformer backbone. It synergistically integrates three modules: a World Model for visual prediction, a Force Predictor for physical dynamics modeling, and an Action Model for policy generation. All modules are implemented as Flow Matching Diffusion Transformers (DiT) and interact through a shared Multi-modal Self-attention layer. Input modalities (vision, action, force, text) are first encoded, conditioned with global features (text, state, noise level) via the adaLN module, and then processed by the shared multi-modal self-attention layer. In our framework, the operational mode is determined by a switch on the action input: in Mode I (Action Generation), the action token is provided as noise for the model to generate an action; in Mode II (Future Imagination), a known action is provided as a condition for future prediction. A LoRA-based mechanism enables efficient online updates of a small set of parameters.
  • Figure 3: Input and output details of our unified model in two different modes.Mode I (Action Generation): The model takes an observation history $o$ (e.g., context length $T_c=5$) and predicts a future action sequence $\{a_t, a_{t+1}, \cdots, a_{t+T_\text{a}}\}$ (e.g., action horizon $T_a=8$). At test time, the robot executes this predicted action sequence. Mode II (Future Imagination): The model is conditioned on both the observation history $o$ and a ground-truth action sequence $a$, then predicts the corresponding future observation sequence $\hat{o}'$. The discrepancy between this prediction and real environmental feedback is used to update the network parameters during our Adaptive Online Learning (AdaOL) phase.
  • Figure 4: Visualizations of the simulated benchmarks used in our experiments: Variant PushT for out-of-distribution robustness, LIBERO for long-horizon skills, and the CALVIN benchmark for language-conditioned tasks across different domains.
  • Figure 5: Overview of our real-robot evaluation. Our experimental setup (center) features an INOVO robotic arm with multi-modal sensing capabilities, including gripper/static cameras and a force sensor. We test our AdaWorldPolicy on four diverse manipulation tasks (left), such as sweeping beans and placing eggs. To specifically evaluate the effectiveness of our AdaOL strategy, we introduce a variety of challenging out-of-distribution (OOD) shifts during execution (right). These include visual perturbations like drastic lighting and texture changes, as well as physical perturbations like swapping objects and altering the workspace geometry (e.g., tilting the whiteboard).
  • ...and 2 more figures