Table of Contents
Fetching ...

MobRT: A Digital Twin-Based Framework for Scalable Learning in Mobile Manipulation

Yilin Mei, Peng Qiu, Wei Zhang, WenChao Zhang, Wenjie Song

TL;DR

MobRT presents a digital-twin framework to scale data generation for mobile manipulation, enabling coherent whole-body interactions with articulated objects and mobile-base tasks. It combines Virtual Kinematic Chains, whole-body motion planning, and a Transformer-based diffusion policy trained with Flow Matching, augmented by real-world demonstrations to improve sim-to-real transfer. A comprehensive MobRT benchmark validates data quality and reveals that additional generated trajectories consistently boost policy success, with the proposed method outperforming strong baselines, especially in data-scarce settings. Mixed data co-training further enhances real-world robustness, underscoring MobRT’s practical impact for mobile manipulation in unstructured environments.

Abstract

Recent advances in robotics have been largely driven by imitation learning, which depends critically on large-scale, high-quality demonstration data. However, collecting such data remains a significant challenge-particularly for mobile manipulators, which must coordinate base locomotion and arm manipulation in high-dimensional, dynamic, and partially observable environments. Consequently, most existing research remains focused on simpler tabletop scenarios, leaving mobile manipulation relatively underexplored. To bridge this gap, we present \textit{MobRT}, a digital twin-based framework designed to simulate two primary categories of complex, whole-body tasks: interaction with articulated objects (e.g., opening doors and drawers) and mobile-base pick-and-place operations. \textit{MobRT} autonomously generates diverse and realistic demonstrations through the integration of virtual kinematic control and whole-body motion planning, enabling coherent and physically consistent execution. We evaluate the quality of \textit{MobRT}-generated data across multiple baseline algorithms, establishing a comprehensive benchmark and demonstrating a strong correlation between task success and the number of generated trajectories. Experiments integrating both simulated and real-world demonstrations confirm that our approach markedly improves policy generalization and performance, achieving robust results in both simulated and real-world environments.

MobRT: A Digital Twin-Based Framework for Scalable Learning in Mobile Manipulation

TL;DR

MobRT presents a digital-twin framework to scale data generation for mobile manipulation, enabling coherent whole-body interactions with articulated objects and mobile-base tasks. It combines Virtual Kinematic Chains, whole-body motion planning, and a Transformer-based diffusion policy trained with Flow Matching, augmented by real-world demonstrations to improve sim-to-real transfer. A comprehensive MobRT benchmark validates data quality and reveals that additional generated trajectories consistently boost policy success, with the proposed method outperforming strong baselines, especially in data-scarce settings. Mixed data co-training further enhances real-world robustness, underscoring MobRT’s practical impact for mobile manipulation in unstructured environments.

Abstract

Recent advances in robotics have been largely driven by imitation learning, which depends critically on large-scale, high-quality demonstration data. However, collecting such data remains a significant challenge-particularly for mobile manipulators, which must coordinate base locomotion and arm manipulation in high-dimensional, dynamic, and partially observable environments. Consequently, most existing research remains focused on simpler tabletop scenarios, leaving mobile manipulation relatively underexplored. To bridge this gap, we present \textit{MobRT}, a digital twin-based framework designed to simulate two primary categories of complex, whole-body tasks: interaction with articulated objects (e.g., opening doors and drawers) and mobile-base pick-and-place operations. \textit{MobRT} autonomously generates diverse and realistic demonstrations through the integration of virtual kinematic control and whole-body motion planning, enabling coherent and physically consistent execution. We evaluate the quality of \textit{MobRT}-generated data across multiple baseline algorithms, establishing a comprehensive benchmark and demonstrating a strong correlation between task success and the number of generated trajectories. Experiments integrating both simulated and real-world demonstrations confirm that our approach markedly improves policy generalization and performance, achieving robust results in both simulated and real-world environments.

Paper Structure

This paper contains 18 sections, 4 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: FewShot Sim2Real with MobRT. By leveraging 300 simulated demonstrations and only 20 real ones, mobile manipulators learn articulated-object interaction and mobile-base pick-and-place tasks with successful sim-to-real transfer.
  • Figure 2: Overview of the MobRT pipeline. The framework integrates three key components: (1) Demonstration generation in simulation, where asset annotation, manipulation actions generation, and whole-body motion planning enable large-scale, low-cost data collection; (2) Adaptation with real-world demonstrations, where a small set of trajectories captures real dynamics and sensor noise to complement the simulated data; and (3) Policy training with hybrid data, where multi-modal encoders process visual and proprioceptive inputs and a transformer-based diffusion policy learns to coordinate mobile-base and arm motions for whole-body tasks.
  • Figure 3: Generating simulation demonstrations. (a). Functional-axis alignment to synthesize pick-and-place actions; (b). VKC to generate articulated-object manipulation; (c). Whole-body Planning vs. Separate Planning for coordinated base–arm motion; (d). Randomized environment reset for diverse, validated data.
  • Figure 4: Real-World experimental environment and Illustration of our robot platform. To test the robustness of the system, the drawers are placed at variable heights.
  • Figure 5: Task Execution of MobRT in Simulation. Representative tasks, including drawer and dishwasher opening and object placement, demonstrate MobRT’s whole-body coordination and sequential manipulation.
  • ...and 2 more figures