Table of Contents
Fetching ...

RoboMatch: A Unified Mobile-Manipulation Teleoperation Platform with Auto-Matching Network Architecture for Long-Horizon Tasks

Hanyu Liu, Yunsheng Ma, Jiaxin Huang, Keqiang Ren, Jiayi Wen, Yilin Zheng, Haoru Luan, Baishu Wan, Pan Li, Jiejun Hou, Zhihua Wang, Zhigong Song

Abstract

This paper presents RoboMatch, a novel unified teleoperation platform for mobile manipulation with an auto-matching network architecture, designed to tackle long-horizon tasks in dynamic environments. Our system enhances teleoperation performance, data collection efficiency, task accuracy, and operational stability. The core of RoboMatch is a cockpit-style control interface that enables synchronous operation of the mobile base and dual arms, significantly improving control precision and data collection. Moreover, we introduce the Proprioceptive-Visual Enhanced Diffusion Policy (PVE-DP), which leverages Discrete Wavelet Transform (DWT) for multi-scale visual feature extraction and integrates high-precision IMUs at the end-effector to enrich proprioceptive feedback, substantially boosting fine manipulation performance. Furthermore, we propose an Auto-Matching Network (AMN) architecture that decomposes long-horizon tasks into logical sequences and dynamically assigns lightweight pre-trained models for distributed inference. Experimental results demonstrate that our approach improves data collection efficiency by over 20%, increases task success rates by 20-30% with PVE-DP, and enhances long-horizon inference performance by approximately 40% with AMN, offering a robust solution for complex manipulation tasks. Project website: https://robomatch.github.io

RoboMatch: A Unified Mobile-Manipulation Teleoperation Platform with Auto-Matching Network Architecture for Long-Horizon Tasks

Abstract

This paper presents RoboMatch, a novel unified teleoperation platform for mobile manipulation with an auto-matching network architecture, designed to tackle long-horizon tasks in dynamic environments. Our system enhances teleoperation performance, data collection efficiency, task accuracy, and operational stability. The core of RoboMatch is a cockpit-style control interface that enables synchronous operation of the mobile base and dual arms, significantly improving control precision and data collection. Moreover, we introduce the Proprioceptive-Visual Enhanced Diffusion Policy (PVE-DP), which leverages Discrete Wavelet Transform (DWT) for multi-scale visual feature extraction and integrates high-precision IMUs at the end-effector to enrich proprioceptive feedback, substantially boosting fine manipulation performance. Furthermore, we propose an Auto-Matching Network (AMN) architecture that decomposes long-horizon tasks into logical sequences and dynamically assigns lightweight pre-trained models for distributed inference. Experimental results demonstrate that our approach improves data collection efficiency by over 20%, increases task success rates by 20-30% with PVE-DP, and enhances long-horizon inference performance by approximately 40% with AMN, offering a robust solution for complex manipulation tasks. Project website: https://robomatch.github.io

Paper Structure

This paper contains 14 sections, 7 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Overview of the RoboMatch framework. This figure integrates three core components: (1) RoboMatch, a unified mobile-manipulation teleoperation platform that matches the robot base with the operation platform to achieve high-precision master-slave control and immersive observation; (2) PVE-DP, a policy combining spatio-frequency visual enhancement and rich proprioception to improve fine manipulation accuracy; (3) AMN, an architecture that combines the semantic parsing capabilities of VLMs with the execution advantages of specialized small policy networks for long-horizon execution, enabling chain-of-thought reasoning for complex task decomposition and adaptive operation.
  • Figure 2: RoboMatch teleoperation demonstration.
  • Figure 3: Overall network architecture of PVE-DP, which consists of three core modules: (1) FE-EMA Module enhances robotic visual perception by integrating spatio-frequency features from visual data; (2) Rich Proprioception strengthens the robot's self-state awareness through concatenating arm joint positions (Qpos) and end-effector quaternion (Quat) from IMU sensing; (3) DP U-Net takes the enhanced visual-proprioceptive observations as conditions and predicts fine-grained action noise to achieve precise manipulation during robot inference.
  • Figure 4: Overview of AMN framework. This figure presents the Auto-Matching Network architecture, featuring: (1) Task Decomposition of complex task $\mathcal{T}$ into sub-tasks for specialized policy networks; (2) Vision-Language Input processing both linguistic commands and visual scenes; (3) Chain-of-Thought Thinking and Planning that breaks tasks into sequential steps matched with pre-trained policies; (4) Inference Execution using PVE-DP with wrist quaternion and visual spatio-frequency fusion for precise manipulation.
  • Figure 5: Comparison between DP and the proposed PVE-DP method in simulation and real-world tasks. The first row demonstrates the inference results of DP, while the second row presents the improved performance of PVE-DP.
  • ...and 1 more figures