Table of Contents
Fetching ...

Multi-Modal Manipulation via Multi-Modal Policy Consensus

Haonan Chen, Jiaming Xu, Hongyu Chen, Kaiwen Hong, Binghao Huang, Chaoqi Liu, Jiayuan Mao, Yunzhu Li, Yilun Du, Katherine Driggs-Campbell

TL;DR

The paper tackles robust multimodal robotic manipulation by addressing the brittleness of feature-level fusion when modalities are sparse or missing. It introduces a modular framework where modality-specific diffusion-based experts are combined via a learned router that assigns consensus weights, enabling incremental addition or removal of modalities without retraining. Empirical results on RLBench and real-world tasks demonstrate superior performance, robustness to perturbations and sensor failures, and context-dependent shifts in modality reliance (e.g., vision for geometry, touch for contact). The approach provides a principled, interpretable alternative to monolithic fusion and has practical implications for scalable, resilient multimodal robotics.

Abstract

Effectively integrating diverse sensory modalities is crucial for robotic manipulation. However, the typical approach of feature concatenation is often suboptimal: dominant modalities such as vision can overwhelm sparse but critical signals like touch in contact-rich tasks, and monolithic architectures cannot flexibly incorporate new or missing modalities without retraining. Our method factorizes the policy into a set of diffusion models, each specialized for a single representation (e.g., vision or touch), and employs a router network that learns consensus weights to adaptively combine their contributions, enabling incremental of new representations. We evaluate our approach on simulated manipulation tasks in {RLBench}, as well as real-world tasks such as occluded object picking, in-hand spoon reorientation, and puzzle insertion, where it significantly outperforms feature-concatenation baselines on scenarios requiring multimodal reasoning. Our policy further demonstrates robustness to physical perturbations and sensor corruption. We further conduct perturbation-based importance analysis, which reveals adaptive shifts between modalities.

Multi-Modal Manipulation via Multi-Modal Policy Consensus

TL;DR

The paper tackles robust multimodal robotic manipulation by addressing the brittleness of feature-level fusion when modalities are sparse or missing. It introduces a modular framework where modality-specific diffusion-based experts are combined via a learned router that assigns consensus weights, enabling incremental addition or removal of modalities without retraining. Empirical results on RLBench and real-world tasks demonstrate superior performance, robustness to perturbations and sensor failures, and context-dependent shifts in modality reliance (e.g., vision for geometry, touch for contact). The approach provides a principled, interpretable alternative to monolithic fusion and has practical implications for scalable, resilient multimodal robotics.

Abstract

Effectively integrating diverse sensory modalities is crucial for robotic manipulation. However, the typical approach of feature concatenation is often suboptimal: dominant modalities such as vision can overwhelm sparse but critical signals like touch in contact-rich tasks, and monolithic architectures cannot flexibly incorporate new or missing modalities without retraining. Our method factorizes the policy into a set of diffusion models, each specialized for a single representation (e.g., vision or touch), and employs a router network that learns consensus weights to adaptively combine their contributions, enabling incremental of new representations. We evaluate our approach on simulated manipulation tasks in {RLBench}, as well as real-world tasks such as occluded object picking, in-hand spoon reorientation, and puzzle insertion, where it significantly outperforms feature-concatenation baselines on scenarios requiring multimodal reasoning. Our policy further demonstrates robustness to physical perturbations and sensor corruption. We further conduct perturbation-based importance analysis, which reveals adaptive shifts between modalities.

Paper Structure

This paper contains 16 sections, 8 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Representation-Composable Policy. (a) Perturbation-based importance analysis in the occluded marker picking task shows that vision dominates early, while tactile signals become important once occluded, demonstrating that our framework dynamically utilizes different modalities across task phases. (b) Classical feature concatenation vs. our policy-level composition, where $m_i$ denotes a modality (e.g., RGB, point cloud, tactile, or learned visual feature). Our compositional design allows individual modality policies to be added or removed without retraining the entire network. (c) Our method unlocks key capabilities. These include Adaptive Sensing, retrieving an occluded marker using tactile feedback during occlusion; In-Hand Reorientation, reorienting a spoon within the gripper; Precise Manipulation, inserting a puzzle piece with fine-grained control; and Multi-Task Learning, consistently outperforming prior work across diverse tasks in RLBench.
  • Figure 2: Overview of Our Compositional Policy Framework. Raw sensory modalities ($m_{\text{rgb}}, m_{\text{tac}}$) are encoded into embeddings ($\mathbf{e}_{\text{rgb}}, \mathbf{e}_{\text{tac}}$). Each modality is factorized into complementary sub-policies (e.g., $\epsilon_{\theta_{\text{rgb,context}}}(e_{\text{rgb}}, a)$, $\epsilon_{\theta_{\text{rgb,local}}}(e_{\text{rgb}}, a)$, $\epsilon_{\theta_{\text{tac,coarse}}}(e_{\text{tac}}, a)$, $\epsilon_{\theta_{\text{tac,fine}}}(e_{\text{tac}}, a)$), which produce score predictions that are averaged into a modality-specific score. A router network $R_\psi(\mathbf{e}_{\text{rgb}}, \dots, \mathbf{e}_{\text{tac}})$ then predicts consensus weights $\{w_i\}$ to reconcile these modality-specific scores into the final composed score $\sum_i w_i \epsilon_i$, which defines the policy for action generation.
  • Figure 3: Real-World Experimental Setup. (a) UR5e manipulator equipped with dual cameras and tactile sensors. (b–d) Overlays of initial conditions for the evaluation tasks: occluded marker picking, spoon reorientation, and puzzle insertion.
  • Figure 4: Qualitative Policy Rollouts. Representative execution traces from three tasks: Task 1 occluded marker picking, where tactile feedback guides manipulation when vision is unavailable; Task 2 spoon reorientation, demonstrating dexterous in-hand manipulation; Task 3 puzzle insertion, requiring high-precision alignment at millimeter accuracy.
  • Figure 5: Typical Failure Cases of Baseline Methods. We show failure cases of an RGB-only policy compared with an RGB+Tactile concatenation baseline. Each task highlights the complementary roles of the two modalities: vision provides global spatial and geometric information, while tactile sensing provides contact awareness and fine-grained grasp feedback. (a) In occluded marker picking, the concatenation baseline becomes trapped without grasping, while RGB-only lacks awareness of the grasp state once occluded. (b) In spoon reorientation, the concatenation baseline fails at initial grasping, while RGB-only fails at precise placement. (c) In puzzle insertion, the concatenation baseline causes misalignment, while RGB-only suffers frequent grasp failures.
  • ...and 2 more figures