Visual-Geometry Diffusion Policy: Robust Generalization via Complementarity-Aware Multimodal Fusion

Yikai Tang; Haoran Geng; Sheng Zang; Pieter Abbeel; Jitendra Malik

Visual-Geometry Diffusion Policy: Robust Generalization via Complementarity-Aware Multimodal Fusion

Yikai Tang, Haoran Geng, Sheng Zang, Pieter Abbeel, Jitendra Malik

TL;DR

The paper tackles the problem of robustness and generalization in visuomotor imitation learning under visual and spatial perturbations. It introduces Visual-Geometry Diffusion Policy (VGDP), a multimodal framework that uses a Complementarity-Aware Fusion Module with modality-wise dropout and a lightweight cross-attention layer to fuse RGB and 3D geometry. Through extensive sim-to-real experiments on the RoboVerse benchmark and four real-world tasks, VGDP achieves substantial improvements in average performance (39.1%), visual robustness (41.5%), and spatial generalization (15.2%), with ablations showing dropout is the key driver behind the expressive fused latent space. The results demonstrate strong zero-shot transfer and robustness to distribution shifts, illustrating the practical impact of principled multimodal fusion for reliable robotic imitation learning. Overall, VGDP advances data-efficient, generalizable visuomotor policies by enforcing balanced, complementary use of diverse sensory cues.

Abstract

Imitation learning has emerged as a crucial ap proach for acquiring visuomotor skills from demonstrations, where designing effective observation encoders is essential for policy generalization. However, existing methods often struggle to generalize under spatial and visual randomizations, instead tending to overfit. To address this challenge, we propose Visual Geometry Diffusion Policy (VGDP), a multimodal imitation learning framework built around a Complementarity-Aware Fusion Module where modality-wise dropout enforces balanced use of RGB and point-cloud cues, with cross-attention serving only as a lightweight interaction layer. Our experiments show that the expressiveness of the fused latent space is largely induced by the enforced complementarity from modality-wise dropout, with cross-attention serving primarily as a lightweight interaction mechanism rather than the main source of robustness. Across a benchmark of 18 simulated tasks and 4 real-world tasks, VGDP outperforms seven baseline policies with an average performance improvement of 39.1%. More importantly, VGDP demonstrates strong robustness under visual and spatial per turbations, surpassing baselines with an average improvement of 41.5% in different visual conditions and 15.2% in different spatial settings.

Visual-Geometry Diffusion Policy: Robust Generalization via Complementarity-Aware Multimodal Fusion

TL;DR

Abstract

Visual-Geometry Diffusion Policy: Robust Generalization via Complementarity-Aware Multimodal Fusion

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)