Table of Contents
Fetching ...

AC-DiT: Adaptive Coordination Diffusion Transformer for Mobile Manipulation

Sixiang Chen, Jiaming Liu, Siyuan Qian, Han Jiang, Lily Li, Renrui Zhang, Zhuoyang Liu, Chenyang Gu, Chengkai Hou, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang

TL;DR

AC-DiT introduces an end-to-end diffusion-transformer framework that explicitly models the coordination between mobile bases and manipulators and adapts visual perception to the stage-specific needs of mobile manipulation. It combines a mobility-to-body conditioning mechanism with a perception-aware multimodal adaptation strategy to fuse 2D and 3D inputs under language guidance. The approach is validated across rich simulation benchmarks and real-world tasks, achieving state-of-the-art performance and demonstrating robust cross-domain transfer. These contributions advance end-to-end mobile manipulation by addressing both kinematic coordination and perceptual modality selection in a unified model.

Abstract

Recently, mobile manipulation has attracted increasing attention for enabling language-conditioned robotic control in household tasks. However, existing methods still face challenges in coordinating mobile base and manipulator, primarily due to two limitations. On the one hand, they fail to explicitly model the influence of the mobile base on manipulator control, which easily leads to error accumulation under high degrees of freedom. On the other hand, they treat the entire mobile manipulation process with the same visual observation modality (e.g., either all 2D or all 3D), overlooking the distinct multimodal perception requirements at different stages during mobile manipulation. To address this, we propose the Adaptive Coordination Diffusion Transformer (AC-DiT), which enhances mobile base and manipulator coordination for end-to-end mobile manipulation. First, since the motion of the mobile base directly influences the manipulator's actions, we introduce a mobility-to-body conditioning mechanism that guides the model to first extract base motion representations, which are then used as context prior for predicting whole-body actions. This enables whole-body control that accounts for the potential impact of the mobile base's motion. Second, to meet the perception requirements at different stages of mobile manipulation, we design a perception-aware multimodal conditioning strategy that dynamically adjusts the fusion weights between various 2D visual images and 3D point clouds, yielding visual features tailored to the current perceptual needs. This allows the model to, for example, adaptively rely more on 2D inputs when semantic information is crucial for action prediction, while placing greater emphasis on 3D geometric information when precise spatial understanding is required. We validate AC-DiT through extensive experiments on both simulated and real-world mobile manipulation tasks.

AC-DiT: Adaptive Coordination Diffusion Transformer for Mobile Manipulation

TL;DR

AC-DiT introduces an end-to-end diffusion-transformer framework that explicitly models the coordination between mobile bases and manipulators and adapts visual perception to the stage-specific needs of mobile manipulation. It combines a mobility-to-body conditioning mechanism with a perception-aware multimodal adaptation strategy to fuse 2D and 3D inputs under language guidance. The approach is validated across rich simulation benchmarks and real-world tasks, achieving state-of-the-art performance and demonstrating robust cross-domain transfer. These contributions advance end-to-end mobile manipulation by addressing both kinematic coordination and perceptual modality selection in a unified model.

Abstract

Recently, mobile manipulation has attracted increasing attention for enabling language-conditioned robotic control in household tasks. However, existing methods still face challenges in coordinating mobile base and manipulator, primarily due to two limitations. On the one hand, they fail to explicitly model the influence of the mobile base on manipulator control, which easily leads to error accumulation under high degrees of freedom. On the other hand, they treat the entire mobile manipulation process with the same visual observation modality (e.g., either all 2D or all 3D), overlooking the distinct multimodal perception requirements at different stages during mobile manipulation. To address this, we propose the Adaptive Coordination Diffusion Transformer (AC-DiT), which enhances mobile base and manipulator coordination for end-to-end mobile manipulation. First, since the motion of the mobile base directly influences the manipulator's actions, we introduce a mobility-to-body conditioning mechanism that guides the model to first extract base motion representations, which are then used as context prior for predicting whole-body actions. This enables whole-body control that accounts for the potential impact of the mobile base's motion. Second, to meet the perception requirements at different stages of mobile manipulation, we design a perception-aware multimodal conditioning strategy that dynamically adjusts the fusion weights between various 2D visual images and 3D point clouds, yielding visual features tailored to the current perceptual needs. This allows the model to, for example, adaptively rely more on 2D inputs when semantic information is crucial for action prediction, while placing greater emphasis on 3D geometric information when precise spatial understanding is required. We validate AC-DiT through extensive experiments on both simulated and real-world mobile manipulation tasks.

Paper Structure

This paper contains 26 sections, 1 equation, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Overview of AC-DiT. The proposed end-to-end mobile manipulation framework enhances the coordination between the mobile base and the manipulator by introducing two key mechanisms: mobile-to-body conditioning and perception-aware multimodal adaptation. The former enables action prediction that conditioning on how upcoming mobile base movements may affect manipulator control, thereby reducing error accumulation. The latter constructs multimodal features tailored to the perception requirements at different stages of the mobile manipulation process. Under this paradigm, AC-DiT demonstrates superior performance in both simulation and real-world environments.
  • Figure 2: AC-DiT framework.We first train the modules in the grey-shaded region under the supervision of mobile base actions, allowing the lightweight mobility action head to learn to extract latent mobility features. After this, we optimize the entire AC-DiT model, enabling the mobile manipulation action head to predict both mobile base and manipulator actions. With the Mobility-to-Body Conditioning mechanism, this action head conditions on the extracted latent mobility features, allowing whole-body action prediction to account for the influence of mobile base motion. Meanwhile, the Perception-Aware Multimodal Adaptation mechanism enables this action head to adaptively assign different importance weights to various visual input features, resulting in a perception-aware visual condition tailored to the perception needs of different manipulation stages.
  • Figure 3: Robot execution visualization of 7 tasks in mobile simulator ManiSkill-HAB and 6 tasks in bimanual simulator RoboTwin.
  • Figure 4: Robot execution progress of AC-DiT in four real-world tasks.
  • Figure 4: Ablation study. 2D and 3D represent whether take images and points cloud as input respectively. MBC and MA denote the proposed Mobility-to-Body Conditioning (MBC) and Perception-aware Multimodal Adaption (PMA) mechanisms, respectively.
  • ...and 8 more figures