Table of Contents
Fetching ...

Pixel Motion Diffusion is What We Need for Robot Control

E-Ro Nguyen, Yichi Zhang, Kanchana Ranasinghe, Xiang Li, Michael S. Ryoo

TL;DR

This work presents DAWN (Diffusion is All The authors Need for robot control), a unified diffusion-based framework for language-conditioned robotic manipulation that bridges high-level motion intent and low-level robot action via structured pixel motion representation.

Abstract

We present DAWN (Diffusion is All We Need for robot control), a unified diffusion-based framework for language-conditioned robotic manipulation that bridges high-level motion intent and low-level robot action via structured pixel motion representation. In DAWN, both the high-level and low-level controllers are modeled as diffusion processes, yielding a fully trainable, end-to-end system with interpretable intermediate motion abstractions. DAWN achieves state-of-the-art results on the challenging CALVIN benchmark, demonstrating strong multi-task performance, and further validates its effectiveness on MetaWorld. Despite the substantial domain gap between simulation and reality and limited real-world data, we demonstrate reliable real-world transfer with only minimal finetuning, illustrating the practical viability of diffusion-based motion abstractions for robotic control. Our results show the effectiveness of combining diffusion modeling with motion-centric representations as a strong baseline for scalable and robust robot learning. Project page: https://eronguyen.github.io/DAWN/

Pixel Motion Diffusion is What We Need for Robot Control

TL;DR

This work presents DAWN (Diffusion is All The authors Need for robot control), a unified diffusion-based framework for language-conditioned robotic manipulation that bridges high-level motion intent and low-level robot action via structured pixel motion representation.

Abstract

We present DAWN (Diffusion is All We Need for robot control), a unified diffusion-based framework for language-conditioned robotic manipulation that bridges high-level motion intent and low-level robot action via structured pixel motion representation. In DAWN, both the high-level and low-level controllers are modeled as diffusion processes, yielding a fully trainable, end-to-end system with interpretable intermediate motion abstractions. DAWN achieves state-of-the-art results on the challenging CALVIN benchmark, demonstrating strong multi-task performance, and further validates its effectiveness on MetaWorld. Despite the substantial domain gap between simulation and reality and limited real-world data, we demonstrate reliable real-world transfer with only minimal finetuning, illustrating the practical viability of diffusion-based motion abstractions for robotic control. Our results show the effectiveness of combining diffusion modeling with motion-centric representations as a strong baseline for scalable and robust robot learning. Project page: https://eronguyen.github.io/DAWN/

Paper Structure

This paper contains 27 sections, 1 equation, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Overview of DAWN with two major diffusion modules. Given visual observations, robot state, and a language instruction, a latent diffusion Motion Director predicts a dense pixel motion representation that describes the desired scene dynamics, which the diffusion policy Action Expert uses to generate executable robot actions. Explicit pixel motion provides a structured, interpretable interface between perception and control.
  • Figure 2: Comparison of Action-Prediction Frameworks. a) VLA directly maps observations and language instructions to action outputs. b) Future-RGB-frame-prediction first generate future visual observations and subsequently condition the action policy on these predicted frames. c) Our proposed DAWN predicts pixel-motion representations via the Motion Director and converts them into actions using the Action Expert, enabling a more informative and structured intermediate representation for action prediction.
  • Figure 3: Architecture of Motion Director. The model encodes the static camera view and denoises it with a U-Net, conditioned on the gripper view, language instruction with a temporal offset. The output is decoded into predicted pixel motions, providing interpretable motion representations.
  • Figure 4: Architecture of Action Expert. The model encodes predicted pixel motion, visual observations, language instruction, and robot state into multimodal features. These inputs condition the denoising process, which iteratively refines noisy actions into executable robot trajectories.
  • Figure 5: Real-world environment examples. a) Our single-arm environment includes a robot arm and two cameras. They are stereo RGB cameras, but we only use one RGB view from each camera. b) The RGB image from the static camera. c) The RGB image from the gripper camera.
  • ...and 6 more figures