Table of Contents
Fetching ...

Visual-Geometry Diffusion Policy: Robust Generalization via Complementarity-Aware Multimodal Fusion

Yikai Tang, Haoran Geng, Sheng Zang, Pieter Abbeel, Jitendra Malik

TL;DR

The paper tackles the problem of robustness and generalization in visuomotor imitation learning under visual and spatial perturbations. It introduces Visual-Geometry Diffusion Policy (VGDP), a multimodal framework that uses a Complementarity-Aware Fusion Module with modality-wise dropout and a lightweight cross-attention layer to fuse RGB and 3D geometry. Through extensive sim-to-real experiments on the RoboVerse benchmark and four real-world tasks, VGDP achieves substantial improvements in average performance (39.1%), visual robustness (41.5%), and spatial generalization (15.2%), with ablations showing dropout is the key driver behind the expressive fused latent space. The results demonstrate strong zero-shot transfer and robustness to distribution shifts, illustrating the practical impact of principled multimodal fusion for reliable robotic imitation learning. Overall, VGDP advances data-efficient, generalizable visuomotor policies by enforcing balanced, complementary use of diverse sensory cues.

Abstract

Imitation learning has emerged as a crucial ap proach for acquiring visuomotor skills from demonstrations, where designing effective observation encoders is essential for policy generalization. However, existing methods often struggle to generalize under spatial and visual randomizations, instead tending to overfit. To address this challenge, we propose Visual Geometry Diffusion Policy (VGDP), a multimodal imitation learning framework built around a Complementarity-Aware Fusion Module where modality-wise dropout enforces balanced use of RGB and point-cloud cues, with cross-attention serving only as a lightweight interaction layer. Our experiments show that the expressiveness of the fused latent space is largely induced by the enforced complementarity from modality-wise dropout, with cross-attention serving primarily as a lightweight interaction mechanism rather than the main source of robustness. Across a benchmark of 18 simulated tasks and 4 real-world tasks, VGDP outperforms seven baseline policies with an average performance improvement of 39.1%. More importantly, VGDP demonstrates strong robustness under visual and spatial per turbations, surpassing baselines with an average improvement of 41.5% in different visual conditions and 15.2% in different spatial settings.

Visual-Geometry Diffusion Policy: Robust Generalization via Complementarity-Aware Multimodal Fusion

TL;DR

The paper tackles the problem of robustness and generalization in visuomotor imitation learning under visual and spatial perturbations. It introduces Visual-Geometry Diffusion Policy (VGDP), a multimodal framework that uses a Complementarity-Aware Fusion Module with modality-wise dropout and a lightweight cross-attention layer to fuse RGB and 3D geometry. Through extensive sim-to-real experiments on the RoboVerse benchmark and four real-world tasks, VGDP achieves substantial improvements in average performance (39.1%), visual robustness (41.5%), and spatial generalization (15.2%), with ablations showing dropout is the key driver behind the expressive fused latent space. The results demonstrate strong zero-shot transfer and robustness to distribution shifts, illustrating the practical impact of principled multimodal fusion for reliable robotic imitation learning. Overall, VGDP advances data-efficient, generalizable visuomotor policies by enforcing balanced, complementary use of diverse sensory cues.

Abstract

Imitation learning has emerged as a crucial ap proach for acquiring visuomotor skills from demonstrations, where designing effective observation encoders is essential for policy generalization. However, existing methods often struggle to generalize under spatial and visual randomizations, instead tending to overfit. To address this challenge, we propose Visual Geometry Diffusion Policy (VGDP), a multimodal imitation learning framework built around a Complementarity-Aware Fusion Module where modality-wise dropout enforces balanced use of RGB and point-cloud cues, with cross-attention serving only as a lightweight interaction layer. Our experiments show that the expressiveness of the fused latent space is largely induced by the enforced complementarity from modality-wise dropout, with cross-attention serving primarily as a lightweight interaction mechanism rather than the main source of robustness. Across a benchmark of 18 simulated tasks and 4 real-world tasks, VGDP outperforms seven baseline policies with an average performance improvement of 39.1%. More importantly, VGDP demonstrates strong robustness under visual and spatial per turbations, surpassing baselines with an average improvement of 41.5% in different visual conditions and 15.2% in different spatial settings.

Paper Structure

This paper contains 21 sections, 4 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Visual-Geometry Diffusion Policy Overview. (a) Observation: The environment is captured by a single-view RGB-D camera together with robot joint states. (b) Perception: Each modality is independently encoded by a dedicated encoder, ensuring comprehensive semantic representation. (c) Integration: The richly represented features learn cross-modal dependencies and contextual relationships via a cross-attention layer, wrapped with modality-wise and element-wise dropout to enforce balanced modality utilization and feature activation. (d) Decision: Conditioned on the fused feature, a noised action is denoised to either provide loss for end-to-end training or output an action during evaluation.
  • Figure 2: Comparative Analysis of Encoder Performance. Figure \ref{['fig:task_plot']} shows how different encoders perform across tasks, with each task score averaged over the three randomization levels. Figure \ref{['fig:randomization_plot']} depicts their performance across randomization levels, where each level score is averaged over all tasks.
  • Figure 3: Environments under Different Randomization Levels. The three brackets, named L0, L1 and L2, demonstrate example randomization in materials, positions and viewpoints of the same task across different randomization levels.
  • Figure 4: Real-world Benchmarks. We deploy VGDP in the real world with a Franka Arm across four challenging tasks, including clutter scene manipulation (PickButter), 6-DoF cereal pouring (PourCereal), fine-grained handling over large spatial generalization (FetchBottle) and precision-driven, force-aware insertion (InsertPlug).
  • Figure 5: Evaluation Results for Real-world Randomization. The grid map represents a bird's-eye-view of the workspace, where each grid corresponds to a bottle placement during evaluation. Policies are trained on expert demonstrations on the 30 dotted grids and evaluated on all 143 positions. Dark-green, light-green, and white represent IID success, OOD success, and failure, respectively.
  • ...and 4 more figures