Table of Contents
Fetching ...

Towards Dexterous Embodied Manipulation via Deep Multi-Sensory Fusion and Sparse Expert Scaling

Yirui Sun, Guangyu Zhuge, Keliang Liu, Jie Gu, Zhihao xia, Qionglin Ren, Chunxu tian, Zhongxue Ga

TL;DR

DeMUSE is presented, a Deep Multimodal Unified Sparse Experts framework leveraging a Diffusion Transformer to integrate RGB, depth, and 6-axis force into a unified serialized stream, validating the necessity of deep multi-sensory integration for complex physical interactions.

Abstract

Realizing dexterous embodied manipulation necessitates the deep integration of heterogeneous multimodal sensory inputs. However, current vision-centric paradigms often overlook the critical force and geometric feedback essential for complex tasks. This paper presents DeMUSE, a Deep Multimodal Unified Sparse Experts framework leveraging a Diffusion Transformer to integrate RGB, depth, and 6-axis force into a unified serialized stream. Adaptive Modality-specific Normalization (AdaMN) is employed to recalibrate modality-aware features, mitigating representation imbalance and harmonizing the heterogeneous distributions of multi-sensory signals. To facilitate efficient scaling, the architecture utilizes a Sparse Mixture-of-Experts (MoE) with shared experts, increasing model capacity for physical priors while maintaining the low inference latency required for real-time control. A Joint denoising objective synchronously synthesizes environmental evolution and action sequences to ensure physical consistency. Achieving success rates of 83.2% and 72.5% in simulation and real-world trials, DeMUSE demonstrates state-of-the-art performance, validating the necessity of deep multi-sensory integration for complex physical interactions.

Towards Dexterous Embodied Manipulation via Deep Multi-Sensory Fusion and Sparse Expert Scaling

TL;DR

DeMUSE is presented, a Deep Multimodal Unified Sparse Experts framework leveraging a Diffusion Transformer to integrate RGB, depth, and 6-axis force into a unified serialized stream, validating the necessity of deep multi-sensory integration for complex physical interactions.

Abstract

Realizing dexterous embodied manipulation necessitates the deep integration of heterogeneous multimodal sensory inputs. However, current vision-centric paradigms often overlook the critical force and geometric feedback essential for complex tasks. This paper presents DeMUSE, a Deep Multimodal Unified Sparse Experts framework leveraging a Diffusion Transformer to integrate RGB, depth, and 6-axis force into a unified serialized stream. Adaptive Modality-specific Normalization (AdaMN) is employed to recalibrate modality-aware features, mitigating representation imbalance and harmonizing the heterogeneous distributions of multi-sensory signals. To facilitate efficient scaling, the architecture utilizes a Sparse Mixture-of-Experts (MoE) with shared experts, increasing model capacity for physical priors while maintaining the low inference latency required for real-time control. A Joint denoising objective synchronously synthesizes environmental evolution and action sequences to ensure physical consistency. Achieving success rates of 83.2% and 72.5% in simulation and real-world trials, DeMUSE demonstrates state-of-the-art performance, validating the necessity of deep multi-sensory integration for complex physical interactions.
Paper Structure (28 sections, 4 equations, 7 figures, 6 tables)

This paper contains 28 sections, 4 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: The top image shows the layout of the experimental platform and the sources of multimodal inputs. The bottom image shows the tasks involved in real robot operation.
  • Figure 2: The DeMUSE architecture for multi-modal embodied synthesis. Heterogeneous tokens are processed through a sequence of Transformer blocks. Each block employs AdaMN to ensure representational balance across modalities and a Sparse MoE layer to achieve high-capacity inference under real-time constraints. The framework synchronously generates future visual evolutions and continuous actions via a unified joint denoising stream.
  • Figure 3: Qualitative results on MetaWorld. Left: Basketball task; Right: Hammer task. Rows from top to bottom represent input Observations, generated Predictions, and actual Actions. High consistency across rows demonstrates the model's ability to capture physically-consistent latent dynamics through joint denoising.
  • Figure 4: Z-axis force profile. Comparison of contact dynamics during the pressing task. DeMUSE achieves stable active compliance ($\sim$10N) within 80ms of contact ($t=80$ms)
  • Figure 5: Terminal liquid level precision. Final liquid level distributions in the "Fill Cup" task. DeMUSE achieves high-precision ($0.33 \pm 0.02$) with minimal variance.
  • ...and 2 more figures