Towards Dexterous Embodied Manipulation via Deep Multi-Sensory Fusion and Sparse Expert Scaling

Yirui Sun; Guangyu Zhuge; Keliang Liu; Jie Gu; Zhihao xia; Qionglin Ren; Chunxu tian; Zhongxue Ga

Towards Dexterous Embodied Manipulation via Deep Multi-Sensory Fusion and Sparse Expert Scaling

Yirui Sun, Guangyu Zhuge, Keliang Liu, Jie Gu, Zhihao xia, Qionglin Ren, Chunxu tian, Zhongxue Ga

TL;DR

DeMUSE is presented, a Deep Multimodal Unified Sparse Experts framework leveraging a Diffusion Transformer to integrate RGB, depth, and 6-axis force into a unified serialized stream, validating the necessity of deep multi-sensory integration for complex physical interactions.

Abstract

Realizing dexterous embodied manipulation necessitates the deep integration of heterogeneous multimodal sensory inputs. However, current vision-centric paradigms often overlook the critical force and geometric feedback essential for complex tasks. This paper presents DeMUSE, a Deep Multimodal Unified Sparse Experts framework leveraging a Diffusion Transformer to integrate RGB, depth, and 6-axis force into a unified serialized stream. Adaptive Modality-specific Normalization (AdaMN) is employed to recalibrate modality-aware features, mitigating representation imbalance and harmonizing the heterogeneous distributions of multi-sensory signals. To facilitate efficient scaling, the architecture utilizes a Sparse Mixture-of-Experts (MoE) with shared experts, increasing model capacity for physical priors while maintaining the low inference latency required for real-time control. A Joint denoising objective synchronously synthesizes environmental evolution and action sequences to ensure physical consistency. Achieving success rates of 83.2% and 72.5% in simulation and real-world trials, DeMUSE demonstrates state-of-the-art performance, validating the necessity of deep multi-sensory integration for complex physical interactions.

Towards Dexterous Embodied Manipulation via Deep Multi-Sensory Fusion and Sparse Expert Scaling

TL;DR

Abstract

Paper Structure (28 sections, 4 equations, 7 figures, 6 tables)

This paper contains 28 sections, 4 equations, 7 figures, 6 tables.

Introduction
Related Work
Generative Models and Multimodal Foundation Models.
Efficient Scaling with Sparse Experts.
Methodology
Problem Formulation
Model Architectures
Adaptive Modality-specific Normalization
Scaling with Mixture-of-Experts
Learning Objectives and Inference
Experiments
Datasets and Baselines
Results and Analysis
Ablation Studies
Conclusion
...and 13 more sections

Figures (7)

Figure 1: The top image shows the layout of the experimental platform and the sources of multimodal inputs. The bottom image shows the tasks involved in real robot operation.
Figure 2: The DeMUSE architecture for multi-modal embodied synthesis. Heterogeneous tokens are processed through a sequence of Transformer blocks. Each block employs AdaMN to ensure representational balance across modalities and a Sparse MoE layer to achieve high-capacity inference under real-time constraints. The framework synchronously generates future visual evolutions and continuous actions via a unified joint denoising stream.
Figure 3: Qualitative results on MetaWorld. Left: Basketball task; Right: Hammer task. Rows from top to bottom represent input Observations, generated Predictions, and actual Actions. High consistency across rows demonstrates the model's ability to capture physically-consistent latent dynamics through joint denoising.
Figure 4: Z-axis force profile. Comparison of contact dynamics during the pressing task. DeMUSE achieves stable active compliance ($\sim$10N) within 80ms of contact ($t=80$ms)
Figure 5: Terminal liquid level precision. Final liquid level distributions in the "Fill Cup" task. DeMUSE achieves high-precision ($0.33 \pm 0.02$) with minimal variance.
...and 2 more figures

Towards Dexterous Embodied Manipulation via Deep Multi-Sensory Fusion and Sparse Expert Scaling

TL;DR

Abstract

Towards Dexterous Embodied Manipulation via Deep Multi-Sensory Fusion and Sparse Expert Scaling

Authors

TL;DR

Abstract

Table of Contents

Figures (7)