Table of Contents
Fetching ...

M4Diffuser: Multi-View Diffusion Policy with Manipulability-Aware Control for Robust Mobile Manipulation

Ju Dong, Lei Zhang, Liding Zhang, Yao Ling, Yu Fu, Kaixin Bai, Zoltán-Csaba Márton, Zhenshan Bing, Zhaopeng Chen, Alois Christian Knoll, Jianwei Zhang

TL;DR

The proposed M4Diffuser is a hybrid framework that integrates a Multi-View Diffusion Policy with a novel Reduced and Manipulability-aware QP (ReM-QP) controller for mobile manipulation, demonstrating robust performance for smooth whole-body coordination, and strong generalization to unseen tasks, paving the way for reliable mobile manipulation in unstructured environments.

Abstract

Mobile manipulation requires the coordinated control of a mobile base and a robotic arm while simultaneously perceiving both global scene context and fine-grained object details. Existing single-view approaches often fail in unstructured environments due to limited fields of view, exploration, and generalization abilities. Moreover, classical controllers, although stable, struggle with efficiency and manipulability near singularities. To address these challenges, we propose M4Diffuser, a hybrid framework that integrates a Multi-View Diffusion Policy with a novel Reduced and Manipulability-aware QP (ReM-QP) controller for mobile manipulation. The diffusion policy leverages proprioceptive states and complementary camera perspectives with both close-range object details and global scene context to generate task-relevant end-effector goals in the world frame. These high-level goals are then executed by the ReM-QP controller, which eliminates slack variables for computational efficiency and incorporates manipulability-aware preferences for robustness near singularities. Comprehensive experiments in simulation and real-world environments show that M4Diffuser achieves 7 to 56 percent higher success rates and reduces collisions by 3 to 31 percent over baselines. Our approach demonstrates robust performance for smooth whole-body coordination, and strong generalization to unseen tasks, paving the way for reliable mobile manipulation in unstructured environments. Details of the demo and supplemental material are available on our project website https://sites.google.com/view/m4diffuser.

M4Diffuser: Multi-View Diffusion Policy with Manipulability-Aware Control for Robust Mobile Manipulation

TL;DR

The proposed M4Diffuser is a hybrid framework that integrates a Multi-View Diffusion Policy with a novel Reduced and Manipulability-aware QP (ReM-QP) controller for mobile manipulation, demonstrating robust performance for smooth whole-body coordination, and strong generalization to unseen tasks, paving the way for reliable mobile manipulation in unstructured environments.

Abstract

Mobile manipulation requires the coordinated control of a mobile base and a robotic arm while simultaneously perceiving both global scene context and fine-grained object details. Existing single-view approaches often fail in unstructured environments due to limited fields of view, exploration, and generalization abilities. Moreover, classical controllers, although stable, struggle with efficiency and manipulability near singularities. To address these challenges, we propose M4Diffuser, a hybrid framework that integrates a Multi-View Diffusion Policy with a novel Reduced and Manipulability-aware QP (ReM-QP) controller for mobile manipulation. The diffusion policy leverages proprioceptive states and complementary camera perspectives with both close-range object details and global scene context to generate task-relevant end-effector goals in the world frame. These high-level goals are then executed by the ReM-QP controller, which eliminates slack variables for computational efficiency and incorporates manipulability-aware preferences for robustness near singularities. Comprehensive experiments in simulation and real-world environments show that M4Diffuser achieves 7 to 56 percent higher success rates and reduces collisions by 3 to 31 percent over baselines. Our approach demonstrates robust performance for smooth whole-body coordination, and strong generalization to unseen tasks, paving the way for reliable mobile manipulation in unstructured environments. Details of the demo and supplemental material are available on our project website https://sites.google.com/view/m4diffuser.

Paper Structure

This paper contains 27 sections, 14 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: ${M}^{4}\text{Diffuser}$: Multi-View Diffusion Policy and ReM-QP controller for robust whole-body mobile manipulation.
  • Figure 2: Diffusion transformer policy architecture. Multi-view RGB observations and proprioceptive states are encoded into latent features, which condition a denoising diffusion process implemented with a Transformer. The policy outputs a desired end-effector goal in the world frame, which is converted into a twist and executed by the ReM-QP controller.
  • Figure 3: ReM-QP controller. The reduced QP formulation eliminates slack variables for faster optimization, while ICN-based preferences improve robustness near singularities. This ensures efficient and stable whole-body execution of the high-level policy outputs.
  • Figure 4: The DARKO robot platform. It is built on an omnidirectional mobile base (RB-Kairos) with mecanum wheels, and equipped with a Franka-Emika-Panda robotic arm (1) carrying a wrist-mounted RealSense D435i camera (2). Additional sensing equipment includes two Sick MicroScan 2D lidars (3), a front-facing Azure Kinect RGB-D camera (4), and an adjustable Azure Kinect RGB-D camera (5)
  • Figure 5: Task phases for the mobile manipulation benchmark in the MuJoCo simulator. The four phases consist of two navigation segments (Nav A, Nav B) and two dexterous manipulation segments (Desk A, Desk B).
  • ...and 4 more figures