Table of Contents
Fetching ...

DyWA: Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation

Jiangran Lyu, Ziming Li, Xuesong Shi, Chaoyi Xu, Yizhou Wang, He Wang

TL;DR

DyWA addresses the challenge of generalizable non-prehensile manipulation under partial observability by jointly predicting future states and adapting to dynamics using historical trajectories. It unifies geometry, state, physics, and actions through a World Action Model conditioned by a Dynamics Adaptation module with FiLM, all within a teacher-student distillation framework that operates from single-view point clouds. The method uses a variable-impedance low-level controller to execute actions, and is trained with domain randomization to enable zero-shot sim-to-real transfer. In simulation, it outperforms baselines by 31.5% in success rate; in the real world, it achieves an average 68% success across diverse objects and friction conditions, demonstrating robust generalization and applicability to real tasks including natural language-guided goals when integrated with VLMs.

Abstract

Nonprehensile manipulation is crucial for handling objects that are too thin, large, or otherwise ungraspable in unstructured environments. While conventional planning-based approaches struggle with complex contact modeling, learning-based methods have recently emerged as a promising alternative. However, existing learning-based approaches face two major limitations: they heavily rely on multi-view cameras and precise pose tracking, and they fail to generalize across varying physical conditions, such as changes in object mass and table friction. To address these challenges, we propose the Dynamics-Adaptive World Action Model (DyWA), a novel framework that enhances action learning by jointly predicting future states while adapting to dynamics variations based on historical trajectories. By unifying the modeling of geometry, state, physics, and robot actions, DyWA enables more robust policy learning under partial observability. Compared to baselines, our method improves the success rate by 31.5% using only single-view point cloud observations in the simulation. Furthermore, DyWA achieves an average success rate of 68% in real-world experiments, demonstrating its ability to generalize across diverse object geometries, adapt to varying table friction, and robustness in challenging scenarios such as half-filled water bottles and slippery surfaces.

DyWA: Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation

TL;DR

DyWA addresses the challenge of generalizable non-prehensile manipulation under partial observability by jointly predicting future states and adapting to dynamics using historical trajectories. It unifies geometry, state, physics, and actions through a World Action Model conditioned by a Dynamics Adaptation module with FiLM, all within a teacher-student distillation framework that operates from single-view point clouds. The method uses a variable-impedance low-level controller to execute actions, and is trained with domain randomization to enable zero-shot sim-to-real transfer. In simulation, it outperforms baselines by 31.5% in success rate; in the real world, it achieves an average 68% success across diverse objects and friction conditions, demonstrating robust generalization and applicability to real tasks including natural language-guided goals when integrated with VLMs.

Abstract

Nonprehensile manipulation is crucial for handling objects that are too thin, large, or otherwise ungraspable in unstructured environments. While conventional planning-based approaches struggle with complex contact modeling, learning-based methods have recently emerged as a promising alternative. However, existing learning-based approaches face two major limitations: they heavily rely on multi-view cameras and precise pose tracking, and they fail to generalize across varying physical conditions, such as changes in object mass and table friction. To address these challenges, we propose the Dynamics-Adaptive World Action Model (DyWA), a novel framework that enhances action learning by jointly predicting future states while adapting to dynamics variations based on historical trajectories. By unifying the modeling of geometry, state, physics, and robot actions, DyWA enables more robust policy learning under partial observability. Compared to baselines, our method improves the success rate by 31.5% using only single-view point cloud observations in the simulation. Furthermore, DyWA achieves an average success rate of 68% in real-world experiments, demonstrating its ability to generalize across diverse object geometries, adapt to varying table friction, and robustness in challenging scenarios such as half-filled water bottles and slippery surfaces.

Paper Structure

This paper contains 39 sections, 13 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: Our World Action Model processes the embeddings of the current observation (partial point cloud, end-effector pose, and joint state) and the goal point cloud (transformed from the initial partial observation) to predict the robot action and next state. Additionally, an adaptation module encodes historical observations and actions, decoding them into the dynamics embedding that conditions the model via FiLM. A pre-trained RL teacher policy (right) supervises both the action and adaptation embedding using privileged full point cloud and physics parameter embeddings.
  • Figure 2: Loss curves during the distillation process. We adopt DAgger which starts with teacher action for execution and gradually adds the weights of student action so that the initial loss declines rapidly. Left: Comparison of imitation loss between using only Dynamics Adaptation and incorporating the World Model. Right: Comparison of World Model loss between using only the World Model and integrating Dynamics Adaptation.
  • Figure 3: Qualatative Results in the real world. The goal pose is shown transparently.
  • Figure 4: By integrating with Vision-Language Models (VLMs), our goal-conditioned policy can be executed based on natural language instructions.
  • Figure 5: Our policy helps grasping a thin card and broad cracker box.
  • ...and 7 more figures