Modality-Composable Diffusion Policy via Inference-Time Distribution-level Composition
Jiahang Cao, Qiang Zhang, Hanzhong Guo, Jiaxu Wang, Hao Cheng, Renjing Xu
TL;DR
This work addresses the limitation of single-modality diffusion policies in robotics by introducing Modality-Composable Diffusion Policy (MCDP), which fuses distribution-level outputs from modality-specific pre-trained DPs at inference time without additional training. Building on compositional diffusion models (CDM), MCDP derives a CFG-free score composition where the overall noise estimate is $\hat{\epsilon}_{\mathcal{M}^*}(\tau_t, t, {\bm{c}}) = \sum_{i=1}^{n} w_i \epsilon_\theta(\tau_t, t, {\bm{c}}_i)$ with $\sum_i w_i=1$, enabling flexible integration of RGB and point-cloud modalities. Experiments on the RoboTwin dataset show that MCDP can improve adaptability and performance when both modalities are reasonably informative, and results reveal how weight configurations steer the composed distribution toward the strengths of each modality. The findings point to practical pathways for generalizable cross-modality, cross-domain, and even cross-embodiment policies, with open-source code to foster broader adoption and extension.
Abstract
Diffusion Policy (DP) has attracted significant attention as an effective method for policy representation due to its capacity to model multi-distribution dynamics. However, current DPs are often based on a single visual modality (e.g., RGB or point cloud), limiting their accuracy and generalization potential. Although training a generalized DP capable of handling heterogeneous multimodal data would enhance performance, it entails substantial computational and data-related costs. To address these challenges, we propose a novel policy composition method: by leveraging multiple pre-trained DPs based on individual visual modalities, we can combine their distributional scores to form a more expressive Modality-Composable Diffusion Policy (MCDP), without the need for additional training. Through extensive empirical experiments on the RoboTwin dataset, we demonstrate the potential of MCDP to improve both adaptability and performance. This exploration aims to provide valuable insights into the flexible composition of existing DPs, facilitating the development of generalizable cross-modality, cross-domain, and even cross-embodiment policies. Our code is open-sourced at https://github.com/AndyCao1125/MCDP.
