Table of Contents
Fetching ...

Act to See, See to Act: Diffusion-Driven Perception-Action Interplay for Adaptive Policies

Jing Wang, Weiting Peng, Jing Tang, Zeyu Gong, Xihua Wang, Bo Tao, Li Cheng

TL;DR

DP-AG tackles the limitation of static perception in imitation learning by instituting a perception–action loop in which latent observations evolve under action-guided diffusion dynamics. It combines a variational latent representation, an action-driven latent SDE powered by a Vector–Jacobian Product, and a cycle-consistent InfoNCE loss to tightly couple perception and action during diffusion steps. The authors derive a variational lower bound for the action-guided SDE and prove that the contrastive objective enforces continuity in both latent and action trajectories. Empirically, DP-AG outperforms state-of-the-art methods on both simulation benchmarks and real UR5 manipulation tasks, achieving smoother trajectories, faster convergence, and higher task success under partial observability and dynamic conditions.

Abstract

Existing imitation learning methods decouple perception and action, which overlooks the causal reciprocity between sensory representations and action execution that humans naturally leverage for adaptive behaviors. To bridge this gap, we introduce Action-Guided Diffusion Policy (DP-AG), a unified representation learning that explicitly models a dynamic interplay between perception and action through probabilistic latent dynamics. DP-AG encodes latent observations into a Gaussian posterior via variational inference and evolves them using an action-guided SDE, where the Vector-Jacobian Product (VJP) of the diffusion policy's noise predictions serves as a structured stochastic force driving latent updates. To promote bidirectional learning between perception and action, we introduce a cycle-consistent contrastive loss that organizes the gradient flow of the noise predictor into a coherent perception-action loop, enforcing mutually consistent transitions in both latent updates and action refinements. Theoretically, we derive a variational lower bound for the action-guided SDE, and prove that the contrastive objective enhances continuity in both latent and action trajectories. Empirically, DP-AG significantly outperforms state-of-the-art methods across simulation benchmarks and real-world UR5 manipulation tasks. As a result, our DP-AG offers a promising step toward bridging biological adaptability and artificial policy learning.

Act to See, See to Act: Diffusion-Driven Perception-Action Interplay for Adaptive Policies

TL;DR

DP-AG tackles the limitation of static perception in imitation learning by instituting a perception–action loop in which latent observations evolve under action-guided diffusion dynamics. It combines a variational latent representation, an action-driven latent SDE powered by a Vector–Jacobian Product, and a cycle-consistent InfoNCE loss to tightly couple perception and action during diffusion steps. The authors derive a variational lower bound for the action-guided SDE and prove that the contrastive objective enforces continuity in both latent and action trajectories. Empirically, DP-AG outperforms state-of-the-art methods on both simulation benchmarks and real UR5 manipulation tasks, achieving smoother trajectories, faster convergence, and higher task success under partial observability and dynamic conditions.

Abstract

Existing imitation learning methods decouple perception and action, which overlooks the causal reciprocity between sensory representations and action execution that humans naturally leverage for adaptive behaviors. To bridge this gap, we introduce Action-Guided Diffusion Policy (DP-AG), a unified representation learning that explicitly models a dynamic interplay between perception and action through probabilistic latent dynamics. DP-AG encodes latent observations into a Gaussian posterior via variational inference and evolves them using an action-guided SDE, where the Vector-Jacobian Product (VJP) of the diffusion policy's noise predictions serves as a structured stochastic force driving latent updates. To promote bidirectional learning between perception and action, we introduce a cycle-consistent contrastive loss that organizes the gradient flow of the noise predictor into a coherent perception-action loop, enforcing mutually consistent transitions in both latent updates and action refinements. Theoretically, we derive a variational lower bound for the action-guided SDE, and prove that the contrastive objective enhances continuity in both latent and action trajectories. Empirically, DP-AG significantly outperforms state-of-the-art methods across simulation benchmarks and real-world UR5 manipulation tasks. As a result, our DP-AG offers a promising step toward bridging biological adaptability and artificial policy learning.

Paper Structure

This paper contains 45 sections, 4 theorems, 64 equations, 17 figures, 19 tables.

Key Result

Lemma 1

For unit-normalized vectors $\varepsilon^i_k$ and $\tilde{\varepsilon}^i_k$, and a temperature $\tau > 0$, if the InfoNCE loss satisfies $\mathcal{L}_{\text{cont}} \leq \alpha$ for some small constant $\alpha$, then for each $i \in \{1, \dots, B\}$, the similarity between corresponding pairs is boun

Figures (17)

  • Figure 1: Use of Observation Features.(a) Conventional methods map observation features directly to actions. (b) DP models action distributions through incremental denoising from white noise, conditioned on observation features. (c) DP-AG refines observation features via noise predictions, establishing a mutually reinforcing cycle between perception and action.
  • Figure 2: Method Overview. While Diffusion Policy (DP) generates actions from static observation features, our DP-AG establishes a dynamic perception–action loop by guiding feature evolution via the VJP of DP’s predicted noise. To reinforce interplay, a cycle-consistent contrastive loss aligns noise predictions from static and evolving features, enabling mutual perception–action influence.
  • Figure 3: Regression results on irregular spirals.Left: Trajectories and latent dynamics predicted by the Base Flow. Right: The VJP-Guided Flow continuously refines latents through output-guided corrections, which results in smoother and more coherent trajectories in both output and latent spaces.
  • Figure 4: Benchmark simulation environments include Robomimic, Franka Kitchen, Push-T, and Dynamic Push-T, with task and dataset details provided in Appendix \ref{['sec:dataset']}.
  • Figure 5: Convergence Plots. Training action MSE over epochs on Push-T and Robomimic Can.
  • ...and 12 more figures

Theorems & Definitions (6)

  • Lemma 1: Noise Similarity Lower Bound
  • Theorem 1: Continuity Upper Bound
  • Lemma : Noise Similarity Lower Bound
  • proof
  • Theorem : Continuity Upper Bound
  • proof