Multi-Modal Manipulation via Multi-Modal Policy Consensus
Haonan Chen, Jiaming Xu, Hongyu Chen, Kaiwen Hong, Binghao Huang, Chaoqi Liu, Jiayuan Mao, Yunzhu Li, Yilun Du, Katherine Driggs-Campbell
TL;DR
The paper tackles robust multimodal robotic manipulation by addressing the brittleness of feature-level fusion when modalities are sparse or missing. It introduces a modular framework where modality-specific diffusion-based experts are combined via a learned router that assigns consensus weights, enabling incremental addition or removal of modalities without retraining. Empirical results on RLBench and real-world tasks demonstrate superior performance, robustness to perturbations and sensor failures, and context-dependent shifts in modality reliance (e.g., vision for geometry, touch for contact). The approach provides a principled, interpretable alternative to monolithic fusion and has practical implications for scalable, resilient multimodal robotics.
Abstract
Effectively integrating diverse sensory modalities is crucial for robotic manipulation. However, the typical approach of feature concatenation is often suboptimal: dominant modalities such as vision can overwhelm sparse but critical signals like touch in contact-rich tasks, and monolithic architectures cannot flexibly incorporate new or missing modalities without retraining. Our method factorizes the policy into a set of diffusion models, each specialized for a single representation (e.g., vision or touch), and employs a router network that learns consensus weights to adaptively combine their contributions, enabling incremental of new representations. We evaluate our approach on simulated manipulation tasks in {RLBench}, as well as real-world tasks such as occluded object picking, in-hand spoon reorientation, and puzzle insertion, where it significantly outperforms feature-concatenation baselines on scenarios requiring multimodal reasoning. Our policy further demonstrates robustness to physical perturbations and sensor corruption. We further conduct perturbation-based importance analysis, which reveals adaptive shifts between modalities.
