Table of Contents
Fetching ...

DiffusionAnything: End-to-End In-context Diffusion Learning for Unified Navigation and Pre-Grasp Motion

Iana Zhura, Yara Mahmoud, Jeffrin Sam, Hung Khang Nguyen, Didar Seyidov, Miguel Altamirano Cabrera, Dzmitry Tsetserukou

Abstract

Efficiently predicting motion plans directly from vision remains a fundamental challenge in robotics, where planning typically requires explicit goal specification and task-specific design. Recent vision-language-action (VLA) models infer actions directly from visual input but demand massive computational resources, extensive training data, and fail zero-shot in novel scenes. We present a unified image-space diffusion policy handling both meter-scale navigation and centimeter-scale manipulation via multi-scale feature modulation, with only 5 minutes of self-supervised data per task. Three key innovations drive the framework: (1) Multi-scale FiLM conditioning on task mode, depth scale, and spatial attention enables task-appropriate behavior in a single model; (2) trajectory-aligned depth prediction focuses metric 3D reasoning along generated waypoints; (3) self-supervised attention from AnyTraverse enables goal-directed inference without vision-language models and depth sensors. Operating purely from RGB input (2.0 GB memory, 10 Hz), the model achieves robust zero-shot generalization to novel scenes while remaining suitable for onboard deployment.

DiffusionAnything: End-to-End In-context Diffusion Learning for Unified Navigation and Pre-Grasp Motion

Abstract

Efficiently predicting motion plans directly from vision remains a fundamental challenge in robotics, where planning typically requires explicit goal specification and task-specific design. Recent vision-language-action (VLA) models infer actions directly from visual input but demand massive computational resources, extensive training data, and fail zero-shot in novel scenes. We present a unified image-space diffusion policy handling both meter-scale navigation and centimeter-scale manipulation via multi-scale feature modulation, with only 5 minutes of self-supervised data per task. Three key innovations drive the framework: (1) Multi-scale FiLM conditioning on task mode, depth scale, and spatial attention enables task-appropriate behavior in a single model; (2) trajectory-aligned depth prediction focuses metric 3D reasoning along generated waypoints; (3) self-supervised attention from AnyTraverse enables goal-directed inference without vision-language models and depth sensors. Operating purely from RGB input (2.0 GB memory, 10 Hz), the model achieves robust zero-shot generalization to novel scenes while remaining suitable for onboard deployment.

Paper Structure

This paper contains 23 sections, 10 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: A single unified diffusion policy handles both centimeter-scale pre-grasping planning (left: object approach with $\pm$2 cm precision, cyan inset shows object attention) and meter-scale motion planning (right: hallway navigation with $\pm$10 cm obstacle avoidance, orange overlay shows traversable floor). Knowledge transfer between tasks is achieved through context-aware conditioning with shared model weights.
  • Figure 2: Context-aware cross-task diffusion policy architecture. Top (Supervision): RGB input is processed by a frozen visual encoder, modulated via FiLM conditioning using task-specific context (task mode, depth scale, spatial attention). The state-modulated decoder outputs traversability $\hat{\mathbf{T}}_t$ for supervision. Attention maps differ by task: navigation uses floor traversability (left heatmap), while manipulation focuses on target objects (right heatmap). Purple dashed lines indicate single-task paths; blue dashed lines show cross-task conditioning shared between operational modes. Bottom (Diffusion): The UNet performs iterative denoising conditioned on context $\mathbf{c}$ and noisy trajectory $\mathbf{x}_t + \mathbf{S}_g$, progressively refining predictions over $N$ steps (right: $\mathbf{x}_t \rightarrow \mathbf{x}_{t-1} \rightarrow \mathbf{x}_0$).
  • Figure 3: Ground truth generation pipeline for start and goal prediction. For each sample: RGB input (64$\times$64), traversability map, object attention heatmap, ground-truth trajectory, and trajectory overlaid on the traversability map.
  • Figure 4: Qualitative results showing attention maps (top) and trajectories (bottom) across three tasks. (a) Exploration: Floor attention identifies traversable regions for obstacle-free navigation. (b) Navigation to Goal: Attention highlights target table while generating goal-directed trajectory. (c) Pre-Grasping: Object-centric attention enables centimeter-precise approach planning. Green circles: start; red stars: predicted goals. All results demonstrate zero-shot generalization on novel test scenes.
  • Figure 5: Task performance on novel test scenes. Navigation: 100% goal and collision success. Pre-grasping: 70.6% goal success, 100% collision-free. Exploration: 100% obstacle avoidance. Zero-shot generalization across all modes.
  • ...and 2 more figures