Table of Contents
Fetching ...

SuperSuit: An Isomorphic Bimodal Interface for Scalable Mobile Manipulation

Tongqing Chen, Hang Wu, Jiasen Wang, Xiaotao Li, Zhu Jin, Lu Fang

TL;DR

Results indicate that consistent kinematic representations across collection modalities enable scalable data acquisition for long-horizon mobile manipulation, and indicate that consistent kinematic representations across collection modalities enable scalable data acquisition for embodied AI.

Abstract

High-quality, long-horizon demonstrations are essential for embodied AI, yet acquiring such data for tightly coupled wheeled mobile manipulators remains a fundamental bottleneck. Unlike fixed-base systems, mobile manipulators require continuous coordination between $SE(2)$ locomotion and precise manipulation, exposing limitations in existing teleoperation and wearable interfaces. We present \textbf{SuperSuit}, a bimodal data acquisition framework that supports both robot-in-the-loop teleoperation and active demonstration under a shared kinematic interface. Both modalities produce structurally identical joint-space trajectories, enabling direct data mixing without modifying downstream policies. For locomotion, SuperSuit maps natural human stepping to continuous planar base velocities, eliminating discrete command switches. For manipulation, it employs a strictly isomorphic wearable arm in both modes, while policy training is formulated in a shift-invariant delta-joint representation to mitigate calibration offsets and structural compliance without inverse kinematics. Real-world experiments on long-horizon mobile manipulation tasks show 2.6$\times$ higher demonstration throughput in active mode compared to a teleoperation baseline, comparable policy performance when substituting teleoperation data with active demonstrations at fixed dataset size, and monotonic performance improvement as active data volume increases. These results indicate that consistent kinematic representations across collection modalities enable scalable data acquisition for long-horizon mobile manipulation.

SuperSuit: An Isomorphic Bimodal Interface for Scalable Mobile Manipulation

TL;DR

Results indicate that consistent kinematic representations across collection modalities enable scalable data acquisition for long-horizon mobile manipulation, and indicate that consistent kinematic representations across collection modalities enable scalable data acquisition for embodied AI.

Abstract

High-quality, long-horizon demonstrations are essential for embodied AI, yet acquiring such data for tightly coupled wheeled mobile manipulators remains a fundamental bottleneck. Unlike fixed-base systems, mobile manipulators require continuous coordination between locomotion and precise manipulation, exposing limitations in existing teleoperation and wearable interfaces. We present \textbf{SuperSuit}, a bimodal data acquisition framework that supports both robot-in-the-loop teleoperation and active demonstration under a shared kinematic interface. Both modalities produce structurally identical joint-space trajectories, enabling direct data mixing without modifying downstream policies. For locomotion, SuperSuit maps natural human stepping to continuous planar base velocities, eliminating discrete command switches. For manipulation, it employs a strictly isomorphic wearable arm in both modes, while policy training is formulated in a shift-invariant delta-joint representation to mitigate calibration offsets and structural compliance without inverse kinematics. Real-world experiments on long-horizon mobile manipulation tasks show 2.6 higher demonstration throughput in active mode compared to a teleoperation baseline, comparable policy performance when substituting teleoperation data with active demonstrations at fixed dataset size, and monotonic performance improvement as active data volume increases. These results indicate that consistent kinematic representations across collection modalities enable scalable data acquisition for long-horizon mobile manipulation.
Paper Structure (17 sections, 7 equations, 6 figures, 6 tables)

This paper contains 17 sections, 7 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: The SuperSuit Framework. Our untethered wearable interface translates human embodiment into whole-body robot control via strict isomorphic arm manipulation and zero-drift base locomotion. This synergistic architecture natively supports bimodal data acquisition (teleoperation and active collection), generating high-fidelity datasets that directly fuel imitation learning policies for autonomous mobile manipulation.
  • Figure 2: System Architecture of SuperSuit. Multimodal human intent is captured and decoupled via a Dual Stream Control Engine. The Dual Stream Control Engine decouples human motion into: (1) Mechanical Arm Stream for upper-body isomorphic mapping, and (2) Tracker Stream for torso and base control. Specifically, the tracker-based 6D pose is decomposed into articulated torso configurations and planar locomotion velocities. A velocity-level deadband is applied to suppress involuntary micro-sway. These robust signals simultaneously drive the mobile manipulator and feed into an LLM-assisted HIL pipeline, merging Qwen3 kinematic reasoning with Paraformer transcriptions to automatically generate high-fidelity, language-annotated datasets for VLA models.
  • Figure 3: Remote Teleoperation Mode. SuperSuit enables intuitive, zero-latency bimanual manipulation across diverse spatial tasks: (a) Pick and Place, (b) Blocks Collection, and (c) Crate Stacking.
  • Figure 4: Active Demonstration Mode. A continuous sequence of the Pick and Place benchmark performed directly by the operator.
  • Figure 5: Kinematic Alignment of the SuperSuit. The exoskeleton's mechanical axes structurally mirror the operator's anatomical degrees of freedom. (Note: The grippers are 3D printed in white for visual clarity and teleoperation, whereas actual data collection employs black grippers identical to the target robot's configuration.)
  • ...and 1 more figures