Table of Contents
Fetching ...

Sample from What You See: Visuomotor Policy Learning via Diffusion Bridge with Observation-Embedded Stochastic Differential Equation

Zhaoyang Liu, Mokai Pan, Zhongyi Wang, Kaizhen Zhu, Haotao Lu, Jingya Wang, Ye Shi

TL;DR

BridgePolicy addresses the limitation that diffusion-based visuomotor policies treat observations only as conditioning signals. By embedding observations directly into the forward diffusion dynamics as a diffusion bridge, the method enables sampling from an observation-informed prior and improves perception-action coupling. The approach introduces a multi-modal fusion module and a semantic aligner to handle heterogeneous sensor inputs, validated by extensive simulations and real-world experiments showing state-of-the-art performance. The work advances diffusion-based imitation learning for robust multimodal robotic control and opens avenues for alternative alignment architectures.

Abstract

Imitation learning with diffusion models has advanced robotic control by capturing multi-modal action distributions. However, existing approaches typically treat observations as high-level conditioning inputs to the denoising network, rather than integrating them into the stochastic dynamics of the diffusion process itself. As a result, sampling must begin from random Gaussian noise, weakening the coupling between perception and control and often yielding suboptimal performance. We introduce BridgePolicy, a generative visuomotor policy that explicitly embeds observations within the stochastic differential equation via a diffusion-bridge formulation. By constructing an observation-informed trajectory, BridgePolicy enables sampling to start from a rich, informative prior rather than random noise, substantially improving precision and reliability in control. A key challenge is that classical diffusion bridges connect distributions with matched dimensionality, whereas robotic observations are heterogeneous and multi-modal and do not naturally align with the action space. To address this, we design a multi-modal fusion module and a semantic aligner that unify visual and state inputs and align observation and action representations, making the bridge applicable to heterogeneous robot data. Extensive experiments across 52 simulation tasks on three benchmarks and five real-world tasks demonstrate that BridgePolicy consistently outperforms state-of-the-art generative policies.

Sample from What You See: Visuomotor Policy Learning via Diffusion Bridge with Observation-Embedded Stochastic Differential Equation

TL;DR

BridgePolicy addresses the limitation that diffusion-based visuomotor policies treat observations only as conditioning signals. By embedding observations directly into the forward diffusion dynamics as a diffusion bridge, the method enables sampling from an observation-informed prior and improves perception-action coupling. The approach introduces a multi-modal fusion module and a semantic aligner to handle heterogeneous sensor inputs, validated by extensive simulations and real-world experiments showing state-of-the-art performance. The work advances diffusion-based imitation learning for robust multimodal robotic control and opens avenues for alternative alignment architectures.

Abstract

Imitation learning with diffusion models has advanced robotic control by capturing multi-modal action distributions. However, existing approaches typically treat observations as high-level conditioning inputs to the denoising network, rather than integrating them into the stochastic dynamics of the diffusion process itself. As a result, sampling must begin from random Gaussian noise, weakening the coupling between perception and control and often yielding suboptimal performance. We introduce BridgePolicy, a generative visuomotor policy that explicitly embeds observations within the stochastic differential equation via a diffusion-bridge formulation. By constructing an observation-informed trajectory, BridgePolicy enables sampling to start from a rich, informative prior rather than random noise, substantially improving precision and reliability in control. A key challenge is that classical diffusion bridges connect distributions with matched dimensionality, whereas robotic observations are heterogeneous and multi-modal and do not naturally align with the action space. To address this, we design a multi-modal fusion module and a semantic aligner that unify visual and state inputs and align observation and action representations, making the bridge applicable to heterogeneous robot data. Extensive experiments across 52 simulation tasks on three benchmarks and five real-world tasks demonstrate that BridgePolicy consistently outperforms state-of-the-art generative policies.

Paper Structure

This paper contains 26 sections, 8 equations, 13 figures, 17 tables, 2 algorithms.

Figures (13)

  • Figure 1: Comparison of Diffusion Policy and our BridgePolicy. The observation modeling way of BridgePolicy allows its sampling of BridgePolicy can start from an rich and meaningful prior instead of the random noise in standard Diffusion Policy.
  • Figure 2: Overview pipelines. BridgePolicy explicitly embeds observations into the diffusion SDE trajectory via a diffusion-bridge formulation. The observation consists of robot states and point cloud. A multi-modal fusion module and a semantic aligner address the challenges of multi-modal distribution bridging and data shape mismatch, which enables effective exploitation of multi-modal observations. During the inference, BridgePolicy samples from the observation and iteratively transforms it into the action through fast sampling.
  • Figure 3: Examples of the simulation tasks.
  • Figure 4: Examples of the real-world tasks.
  • Figure 5: Real-robot comparative visualization of DP3, FlowPolicy, and BridgePolicy at four critical waypoints for Oven-Opening task.
  • ...and 8 more figures