Table of Contents
Fetching ...

Modality-Augmented Fine-Tuning of Foundation Robot Policies for Cross-Embodiment Manipulation on GR1 and G1

Junsung Park, Hogun Kee, Songhwai Oh

TL;DR

The paper tackles cross-embodiment manipulation by augmenting foundation robot policies with modality-enhanced fine-tuning. It introduces a GR1-based modality-augmented pipeline (depth and contact cues) and builds a high-quality multi-modal G1 dataset using cuRobo planning, IK, and ground-truth contact forces to study transfer from GR1 to G1. Empirical results show depth-driven gains for GR1 and contact-force guidance as crucial for reliable G1 transfer, achieving up to 94% success in a Pick Apple to Bowl task. The work provides a data-centric pathway for extending foundation robot policies to new embodiments by aligning sensory modalities with robot morphologies and interaction dynamics.

Abstract

This paper presents a modality-augmented fine-tuning framework designed to adapt foundation robot policies to diverse humanoid embodiments. We validate our approach across two distinct settings: (i) the GR1 embodiment, utilizing public datasets where we introduce post-processed modalities, including binary contact signals and ZoeDepth-generated metric depth; and (ii) the Unitree G1 embodiment, for which we contribute a novel multi-modal dataset incorporating cuRobo motion planning, inverse kinematics, and ground-truth contact-force measurements. Our experiments demonstrate that modality augmentation consistently enhances policy performance across different embodiments. Specifically, for the GR1, integrating contact-state cues and RGB-D fusion improves online success rates from 51% to 63%. Furthermore, in the G1 "Pick Apple to Bowl" task, our contact-augmented model achieves a success rate of 94%, significantly outperforming the 48% achieved by standard fine-tuning and the 0% baseline of zero-shot transfer. These results highlight that lightweight post-processing effectively strengthens policies for GR1, while high-quality multi-modal data is crucial for reliable transfer to the Unitree G1. Consequently, this work establishes a unified, data-centric pathway for extending foundation robot policies through targeted modality design and multi-modal fine-tuning.

Modality-Augmented Fine-Tuning of Foundation Robot Policies for Cross-Embodiment Manipulation on GR1 and G1

TL;DR

The paper tackles cross-embodiment manipulation by augmenting foundation robot policies with modality-enhanced fine-tuning. It introduces a GR1-based modality-augmented pipeline (depth and contact cues) and builds a high-quality multi-modal G1 dataset using cuRobo planning, IK, and ground-truth contact forces to study transfer from GR1 to G1. Empirical results show depth-driven gains for GR1 and contact-force guidance as crucial for reliable G1 transfer, achieving up to 94% success in a Pick Apple to Bowl task. The work provides a data-centric pathway for extending foundation robot policies to new embodiments by aligning sensory modalities with robot morphologies and interaction dynamics.

Abstract

This paper presents a modality-augmented fine-tuning framework designed to adapt foundation robot policies to diverse humanoid embodiments. We validate our approach across two distinct settings: (i) the GR1 embodiment, utilizing public datasets where we introduce post-processed modalities, including binary contact signals and ZoeDepth-generated metric depth; and (ii) the Unitree G1 embodiment, for which we contribute a novel multi-modal dataset incorporating cuRobo motion planning, inverse kinematics, and ground-truth contact-force measurements. Our experiments demonstrate that modality augmentation consistently enhances policy performance across different embodiments. Specifically, for the GR1, integrating contact-state cues and RGB-D fusion improves online success rates from 51% to 63%. Furthermore, in the G1 "Pick Apple to Bowl" task, our contact-augmented model achieves a success rate of 94%, significantly outperforming the 48% achieved by standard fine-tuning and the 0% baseline of zero-shot transfer. These results highlight that lightweight post-processing effectively strengthens policies for GR1, while high-quality multi-modal data is crucial for reliable transfer to the Unitree G1. Consequently, this work establishes a unified, data-centric pathway for extending foundation robot policies through targeted modality design and multi-modal fine-tuning.

Paper Structure

This paper contains 23 sections, 12 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Overview of the GR00T dual-system architecture. System 2 (VLM) processes image observations and language instructions into semantic tokens, while System 1 (Diffusion Transformer) generates motor actions via iterative denoising conditioned on multimodal tokens.
  • Figure 2: Visualization of the G1 data collection environment. Left: front view of the scene setup. Right: ego-centric view from the robot’s onboard camera during demonstration execution.
  • Figure 3: State vs. Action per G1 joint. For each joint, we visualize the proprioceptive state (blue) and the executed action command (red) over time. The trajectories generated through cuRobo-based planning exhibit smooth, consistent evolution across the 20 DoF upper-limb chain, ensuring high-quality supervision for diffusion policy fine-tuning.
  • Figure 4: Contact forces per finger over time. The G1 dexterous hand provides high-fidelity fingertip and palm force measurements. During grasp execution, contact is concentrated on specific fingers (left thumb, middle, and palm), while others remain inactive. These rich interaction signals are used to train contact-aware diffusion policies.
  • Figure 5: Dedicated Contact Encoder Module. Instead of concatenating contact into the proprioceptive state, the binary (or continuous) contact signal $c_t$ is processed by a learnable Contact Encoder. The resulting embedding is input to the DiT blocks as a separate modality, alongside vision, language, and state tokens. This design treats contact as an independent modality and allows the policy to learn richer interaction-aware representations.
  • ...and 4 more figures