Table of Contents
Fetching ...

COMPASS: Cross-embodiment Mobility Policy via Residual RL and Skill Synthesis

Wei Liu, Huihua Zhao, Chenran Li, Yuchen Deng, Joydeep Biswas, Soha Pouya, Yan Chang

TL;DR

The paper tackles scalable mobility across diverse robot embodiments by proposing COMPASS, a three-stage pipeline that first learns mobility priors via imitation learning, then specializes these priors for each embodiment with residual RL, and finally distills the specialists into a single generalist policy conditioned on an embodiment embedding. By leveraging a world-model-based latent representation, COMPASS enables data-efficient adaptation and strong cross-embodiment generalization, including zero-shot sim-to-real transfer. Empirical results demonstrate large improvements in success rate (SR) and travel efficiency (WTT) for embodiment specialists and a robust generalist policy that performs comparably to specialists across multiple platforms. The approach also enables applications beyond mobility, such as open vocabulary navigation and synthetic data generation for downstream Vision-Language-Action models, highlighting practical impact for scalable, adaptable robotic systems.

Abstract

As robots are increasingly deployed in diverse application domains, enabling robust mobility across different embodiments has become a critical challenge. Classical mobility stacks, though effective on specific platforms, require extensive per-robot tuning and do not scale easily to new embodiments. Learning-based approaches, such as imitation learning (IL), offer alternatives, but face significant limitations on the need for high-quality demonstrations for each embodiment. To address these challenges, we introduce COMPASS, a unified framework that enables scalable cross-embodiment mobility using expert demonstrations from only a single embodiment. We first pre-train a mobility policy on a single robot using IL, combining a world model with a policy model. We then apply residual reinforcement learning (RL) to efficiently adapt this policy to diverse embodiments through corrective refinements. Finally, we distill specialist policies into a single generalist policy conditioned on an embodiment embedding vector. This design significantly reduces the burden of collecting data while enabling robust generalization across a wide range of robot designs. Our experiments demonstrate that COMPASS scales effectively across diverse robot platforms while maintaining adaptability to various environment configurations, achieving a generalist policy with a success rate approximately 5X higher than the pre-trained IL policy on unseen embodiments, and further demonstrates zero-shot sim-to-real transfer.

COMPASS: Cross-embodiment Mobility Policy via Residual RL and Skill Synthesis

TL;DR

The paper tackles scalable mobility across diverse robot embodiments by proposing COMPASS, a three-stage pipeline that first learns mobility priors via imitation learning, then specializes these priors for each embodiment with residual RL, and finally distills the specialists into a single generalist policy conditioned on an embodiment embedding. By leveraging a world-model-based latent representation, COMPASS enables data-efficient adaptation and strong cross-embodiment generalization, including zero-shot sim-to-real transfer. Empirical results demonstrate large improvements in success rate (SR) and travel efficiency (WTT) for embodiment specialists and a robust generalist policy that performs comparably to specialists across multiple platforms. The approach also enables applications beyond mobility, such as open vocabulary navigation and synthetic data generation for downstream Vision-Language-Action models, highlighting practical impact for scalable, adaptable robotic systems.

Abstract

As robots are increasingly deployed in diverse application domains, enabling robust mobility across different embodiments has become a critical challenge. Classical mobility stacks, though effective on specific platforms, require extensive per-robot tuning and do not scale easily to new embodiments. Learning-based approaches, such as imitation learning (IL), offer alternatives, but face significant limitations on the need for high-quality demonstrations for each embodiment. To address these challenges, we introduce COMPASS, a unified framework that enables scalable cross-embodiment mobility using expert demonstrations from only a single embodiment. We first pre-train a mobility policy on a single robot using IL, combining a world model with a policy model. We then apply residual reinforcement learning (RL) to efficiently adapt this policy to diverse embodiments through corrective refinements. Finally, we distill specialist policies into a single generalist policy conditioned on an embodiment embedding vector. This design significantly reduces the burden of collecting data while enabling robust generalization across a wide range of robot designs. Our experiments demonstrate that COMPASS scales effectively across diverse robot platforms while maintaining adaptability to various environment configurations, achieving a generalist policy with a success rate approximately 5X higher than the pre-trained IL policy on unseen embodiments, and further demonstrates zero-shot sim-to-real transfer.

Paper Structure

This paper contains 42 sections, 8 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: High-level overview of the COMPASS workflow: (1) Imitation learning produces a base policy and world model using readily available teacher policies on a mobile robot. (2) Residual RL fine-tunes the base policy for multiple embodiments, optimizing for physical constraints and sensor modalities. (3) Policy distillation consolidates these embodiment-specialist policies into one robust cross-embodiment policy.
  • Figure 2: Residual RL architecture: (a) residual RL loop and (b) world model architecture. The world model processes the same inputs as the IL approach to produce the policy state, while the imitation-learned base policy generates a base action. The residual policy refines this action with a correction term, producing the final velocity command for embodiment-specific joint controllers. With the joint actions, the robot interacts with the environment and receives new observations and rewards. The data recorder records the pairs of policy state and action for policy distillation.
  • Figure 3: Policy distillation aggregates multiple expert specialists (one per embodiment). The final multi-embodiment policy uses a one-hot or learned embedding to condition decisions on robot morphology.
  • Figure 4: Residual RL training environment. Multiple instances are tiled and run in parallel to accelerate data collection and model updates.
  • Figure 5: Evaluation environments with multiple layouts to assess policy generalization: (a) warehouse with single rack ($24m\times38 m$), (b) warehouse with multi racks ($24m\times38 m$), (c) office ($10m\times10 m$), and (d) combined scenes with multi racks ($30m\times 38 m$).
  • ...and 4 more figures