Table of Contents
Fetching ...

Diffusion Trajectory-guided Policy for Long-horizon Robot Manipulation

Shichao Fan, Quantao Yang, Yajie Liu, Kun Wu, Zhengping Che, Qingjie Liu, Min Wan

TL;DR

This work addresses the difficulty of generalizing imitation learning to long-horizon robot manipulation under data scarcity and error compounding. It introduces Diffusion Trajectory-guided Policy (DTP), a two-stage framework where a Diffusion Trajectory Model generates task-relevant 2D trajectories from vision-language inputs, which then guide a Transformer-based policy. The approach achieves a notable 25% improvement in average success on the CALVIN benchmark and demonstrates data-efficient learning and real-world viability. By providing trajectory-level guidance and leveraging a diffusion-based auxiliary system, DTP reduces error accumulation and enhances transfer to unseen environments and longer task sequences, with practical implications for scalable, language-conditioned robotic manipulation.

Abstract

Recently, Vision-Language-Action models (VLA) have advanced robot imitation learning, but high data collection costs and limited demonstrations hinder generalization and current imitation learning methods struggle in out-of-distribution scenarios, especially for long-horizon tasks. A key challenge is how to mitigate compounding errors in imitation learning, which lead to cascading failures over extended trajectories. To address these challenges, we propose the Diffusion Trajectory-guided Policy (DTP) framework, which generates 2D trajectories through a diffusion model to guide policy learning for long-horizon tasks. By leveraging task-relevant trajectories, DTP provides trajectory-level guidance to reduce error accumulation. Our two-stage approach first trains a generative vision-language model to create diffusion-based trajectories, then refines the imitation policy using them. Experiments on the CALVIN benchmark show that DTP outperforms state-of-the-art baselines by 25% in success rate, starting from scratch without external pretraining. Moreover, DTP significantly improves real-world robot performance.

Diffusion Trajectory-guided Policy for Long-horizon Robot Manipulation

TL;DR

This work addresses the difficulty of generalizing imitation learning to long-horizon robot manipulation under data scarcity and error compounding. It introduces Diffusion Trajectory-guided Policy (DTP), a two-stage framework where a Diffusion Trajectory Model generates task-relevant 2D trajectories from vision-language inputs, which then guide a Transformer-based policy. The approach achieves a notable 25% improvement in average success on the CALVIN benchmark and demonstrates data-efficient learning and real-world viability. By providing trajectory-level guidance and leveraging a diffusion-based auxiliary system, DTP reduces error accumulation and enhances transfer to unseen environments and longer task sequences, with practical implications for scalable, language-conditioned robotic manipulation.

Abstract

Recently, Vision-Language-Action models (VLA) have advanced robot imitation learning, but high data collection costs and limited demonstrations hinder generalization and current imitation learning methods struggle in out-of-distribution scenarios, especially for long-horizon tasks. A key challenge is how to mitigate compounding errors in imitation learning, which lead to cascading failures over extended trajectories. To address these challenges, we propose the Diffusion Trajectory-guided Policy (DTP) framework, which generates 2D trajectories through a diffusion model to guide policy learning for long-horizon tasks. By leveraging task-relevant trajectories, DTP provides trajectory-level guidance to reduce error accumulation. Our two-stage approach first trains a generative vision-language model to create diffusion-based trajectories, then refines the imitation policy using them. Experiments on the CALVIN benchmark show that DTP outperforms state-of-the-art baselines by 25% in success rate, starting from scratch without external pretraining. Moreover, DTP significantly improves real-world robot performance.

Paper Structure

This paper contains 15 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: System overview. a) and b) present a task instruction with the initial task observation, allowing our Diffusion Trajectory Model to predict the complete future 2D-particle trajectories; c) illustrates the Diffusion Trajectory-guided pipeline, showcasing how these predicted trajectories guide the manipulation policy.
  • Figure 2: System architecture for learning language-conditioned policies. a) shows the input modalities, including vision, language, and proprioception. b) describes the Diffusion Trajectory Model, detailing how vision and language inputs generate diffusion particle trajectories. c) explains how these trajectories guide the training of robot policies, focusing on the learning of the Diffusion Trajectory Policy. Masked learnable tokens represent the particle trajectory prediction token, action token, and video prediction token, respectively.
  • Figure 3: The upper four environments correspond to the CALVIN ABCD settings. The bottom section shows a sequence of five long-horizon tasks, each guided by a specific instruction.
  • Figure 4: Real-robot experiments: a) Franka performing five distinct manipulation tasks. b) Franka performing one long-horizon task composed of subtasks (A–B–C).
  • Figure 5: Diffusion Trajectory Visualization. The left half part illustrates diffusion trajectory generation in the CALVIN environment, while the right half part show trajectory generation in a real-world robotic scenario.