Table of Contents
Fetching ...

Ada3Drift: Adaptive Training-Time Drifting for One-Step 3D Visuomotor Robotic Manipulation

Chongyang Xu, Yixian Zou, Ziliang Feng, Fanman Meng, Shuaicheng Liu

Abstract

Diffusion-based visuomotor policies effectively capture multimodal action distributions through iterative denoising, but their high inference latency limits real-time robotic control. Recent flow matching and consistency-based methods achieve single-step generation, yet sacrifice the ability to preserve distinct action modes, collapsing multimodal behaviors into averaged, often physically infeasible trajectories. We observe that the compute budget asymmetry in robotics (offline training vs.\ real-time inference) naturally motivates recovering this multimodal fidelity by shifting iterative refinement from inference time to training time. Building on this insight, we propose Ada3Drift, which learns a training-time drifting field that attracts predicted actions toward expert demonstration modes while repelling them from other generated samples, enabling high-fidelity single-step generation (1 NFE) from 3D point cloud observations. To handle the few-shot robotic regime, Ada3Drift further introduces a sigmoid-scheduled loss transition from coarse distribution learning to mode-sharpening refinement, and multi-scale field aggregation that captures action modes at varying spatial granularities. Experiments on three simulation benchmarks (Adroit, Meta-World, and RoboTwin) and real-world robotic manipulation tasks demonstrate that Ada3Drift achieves state-of-the-art performance while requiring $10\times$ fewer function evaluations than diffusion-based alternatives.

Ada3Drift: Adaptive Training-Time Drifting for One-Step 3D Visuomotor Robotic Manipulation

Abstract

Diffusion-based visuomotor policies effectively capture multimodal action distributions through iterative denoising, but their high inference latency limits real-time robotic control. Recent flow matching and consistency-based methods achieve single-step generation, yet sacrifice the ability to preserve distinct action modes, collapsing multimodal behaviors into averaged, often physically infeasible trajectories. We observe that the compute budget asymmetry in robotics (offline training vs.\ real-time inference) naturally motivates recovering this multimodal fidelity by shifting iterative refinement from inference time to training time. Building on this insight, we propose Ada3Drift, which learns a training-time drifting field that attracts predicted actions toward expert demonstration modes while repelling them from other generated samples, enabling high-fidelity single-step generation (1 NFE) from 3D point cloud observations. To handle the few-shot robotic regime, Ada3Drift further introduces a sigmoid-scheduled loss transition from coarse distribution learning to mode-sharpening refinement, and multi-scale field aggregation that captures action modes at varying spatial granularities. Experiments on three simulation benchmarks (Adroit, Meta-World, and RoboTwin) and real-world robotic manipulation tasks demonstrate that Ada3Drift achieves state-of-the-art performance while requiring fewer function evaluations than diffusion-based alternatives.
Paper Structure (15 sections, 9 equations, 6 figures, 5 tables)

This paper contains 15 sections, 9 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Single-step generation under multimodal action distributions. (a) Flow Matching: straight conditional paths collapse distinct modes into a single mode average, producing unsafe averaged actions. (b) Mean Flow: adaptive weighting adjusts paths but still converges to a biased mean between modes. (c) Adaptive Drifting (Ours): the drifting field $V(\hat{\mathbf{x}})$ steers predictions toward true demonstration modes during training via attraction and repulsion, enabling accurate multimodal recovery with 1 NFE at inference.
  • Figure 2: Mechanism of adaptive drifting. (a) Predictions $\hat{\mathbf{x}}$ are attracted toward expert modes $\mathbf{y}^+$ and repelled from each other via bidirectional affinity. Concentric contours show multi-temperature field aggregation ($\tau \!\in\! \{0.02, 0.05, 0.2\}$): small $\tau$ captures tight clusters, large $\tau$ covers broader structure. (b) Sigmoid-scheduled loss transitions from MSE-dominated coarse learning to drift-based mode sharpening at $0.7E$.
  • Figure 3: Overview of Ada3Drift.Left: the timestep-free 1D U-Net maps Gaussian noise $\mathbf{z}$ to action trajectories $\hat{\mathbf{x}}$. The encoder (blue) downsamples the temporal dimension; the decoder (teal) upsamples with skip connections. The bottom row (training only) shows the drifting field loss: displacement vectors aggregated across multiple temperatures and combined with MSE loss via sigmoid scheduling. Right: the 3D observation encoder extracts a global conditioning vector $\mathbf{g}$, which modulates every residual block via FiLM. At inference, only the U-Net forward pass is executed (1 NFE).
  • Figure 4: Training curves on Adroit dexterous manipulation tasks. Success rate (mean $\pm$ std over 3 seeds) versus training epoch. Ada3Drift consistently matches or outperforms other single-step methods (Flow Policy, MP1) across all three tasks.
  • Figure 5: Qualitative comparison. Predicted action trajectories of FlowPolicy, MP1, and Ada3Drift on representative tasks. Ada3Drift generates trajectories that better align with the expert demonstrations, especially in multimodal scenarios.
  • ...and 1 more figures