Table of Contents
Fetching ...

Trace-Focused Diffusion Policy for Multi-Modal Action Disambiguation in Long-Horizon Robotic Manipulation

Yuxuan Hu, Xiangyu Chen, Chuhao Zhou, Yuxi Liu, Gen Li, Jindou Jia, Jianfei Yang

TL;DR

The Trace-Focused Diffusion Policy (TF-DP), a simple yet effective diffusion-based framework that explicitly conditions action generation on the robot's execution history, is proposed, demonstrating that execution-trace conditioning offers a scalable and principled approach for robust long-horizon robotic manipulation within a single policy.

Abstract

Generative model-based policies have shown strong performance in imitation-based robotic manipulation by learning action distributions from demonstrations. However, in long-horizon tasks, visually similar observations often recur across execution stages while requiring distinct actions, which leads to ambiguous predictions when policies are conditioned only on instantaneous observations, termed multi-modal action ambiguity (MA2). To address this challenge, we propose the Trace-Focused Diffusion Policy (TF-DP), a simple yet effective diffusion-based framework that explicitly conditions action generation on the robot's execution history. TF-DP represents historical motion as an explicit execution trace and projects it into the visual observation space, providing stage-aware context when current observations alone are insufficient. In addition, the induced trace-focused field emphasizes task-relevant regions associated with historical motion, improving robustness to background visual disturbances. We evaluate TF-DP on real-world robotic manipulation tasks exhibiting pronounced multi-modal action ambiguity and visually cluttered conditions. Experimental results show that TF-DP improves temporal consistency and robustness, outperforming the vanilla diffusion policy by 80.56 percent on tasks with multi-modal action ambiguity and by 86.11 percent under visual disturbances, while maintaining inference efficiency with only a 6.4 percent runtime increase. These results demonstrate that execution-trace conditioning offers a scalable and principled approach for robust long-horizon robotic manipulation within a single policy.

Trace-Focused Diffusion Policy for Multi-Modal Action Disambiguation in Long-Horizon Robotic Manipulation

TL;DR

The Trace-Focused Diffusion Policy (TF-DP), a simple yet effective diffusion-based framework that explicitly conditions action generation on the robot's execution history, is proposed, demonstrating that execution-trace conditioning offers a scalable and principled approach for robust long-horizon robotic manipulation within a single policy.

Abstract

Generative model-based policies have shown strong performance in imitation-based robotic manipulation by learning action distributions from demonstrations. However, in long-horizon tasks, visually similar observations often recur across execution stages while requiring distinct actions, which leads to ambiguous predictions when policies are conditioned only on instantaneous observations, termed multi-modal action ambiguity (MA2). To address this challenge, we propose the Trace-Focused Diffusion Policy (TF-DP), a simple yet effective diffusion-based framework that explicitly conditions action generation on the robot's execution history. TF-DP represents historical motion as an explicit execution trace and projects it into the visual observation space, providing stage-aware context when current observations alone are insufficient. In addition, the induced trace-focused field emphasizes task-relevant regions associated with historical motion, improving robustness to background visual disturbances. We evaluate TF-DP on real-world robotic manipulation tasks exhibiting pronounced multi-modal action ambiguity and visually cluttered conditions. Experimental results show that TF-DP improves temporal consistency and robustness, outperforming the vanilla diffusion policy by 80.56 percent on tasks with multi-modal action ambiguity and by 86.11 percent under visual disturbances, while maintaining inference efficiency with only a 6.4 percent runtime increase. These results demonstrate that execution-trace conditioning offers a scalable and principled approach for robust long-horizon robotic manipulation within a single policy.
Paper Structure (22 sections, 8 equations, 8 figures, 4 tables)

This paper contains 22 sections, 8 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Trace-Focused Diffusion Policy for Resolving Multi-modal Action Ambiguity (MA$^2$). (a) In long-horizon manipulation, visually similar observations map to different actions at different execution stages. (b) This one-to-many mapping causes MA$^2$ for diffusion policies conditioned only on instantaneous observations. (c) TF-DP resolves MA$^2$ by conditioning on explicit motion traces and a trace-focused field, enabling temporally consistent actions.
  • Figure 2: The framework of the proposed TF-DP. The historical robot motions are collected to create the motion trace. The proposed Trace-Focused Field is generated from the trace. Then, the trace and focused field are projected to the image space to resolve the MA$^2$ and mitigate the visual disturbance in the background.
  • Figure 3: The experimental setup and the evaluation tasks. In the workspace, Franka Research 3 is used as the robotic manipulation platform with one wrist camera, one side camera, and one top camera. Three evaluation tasks, including Place cube, Press keyboard, and Pick & Place cubes from drawers, are selected since they all have MA$^2$. The multicolored dashed circle represents the scene when the robot has the same observation but requires different action choices.
  • Figure 4: Execution trajectory comparisons under tasks with the MA$^2$ problem. We visualize the representative trajectories for the selected tasks (Place cube, Press keyboard, and Pick & Place cubes from drawers). From left to right: task setup with sequence information, ground-truth trajectory from human demonstrations, diffusion policy (DP), DP conditioned on past actions (DP-HistAct), TF-DP with only trace, and TF-DP. DP and DP-HistAct fail to follow the correct execution order, while TF-DP variants produce temporally consistent trajectories that match the demonstrated action sequences.
  • Figure 5: Visualization of input representations used by different methods. (a) DP, Diffusion Policy; (b) DP-HistAct, DP with action history conditioning; (c) TF-DP (trace), TF-DP with execution trace only; and (d) TF-DP, full TF-DP incorporating both execution traces and the trace-focused field.
  • ...and 3 more figures