GR-MG: Leveraging Partially Annotated Data via Multi-Modal Goal-Conditioned Policy
Peiyan Li, Hongtao Wu, Yan Huang, Chilam Cheang, Liang Wang, Tao Kong
TL;DR
GR-MG tackles language-conditioned robotic manipulation by leveraging partially-annotated data through a two-module framework: a progress-guided diffusion-based goal image generator and a multi-modal goal-conditioned policy that uses both a text instruction $l$ and a generated goal image. The goal image is produced via diffusion conditioned on the current observation and progress (represented as a discretized cue), and the policy, a GPT-style transformer, predicts a $k$-step action trajectory while also regressing task progress, using the MAE-tokenized goal image as input. Training integrates data without action labels and data without text labels, enabling full use of datasets that are easier to collect, while inference uses text plus the generated goal image to guide actions. Across CALVIN and real-robot experiments, GR-MG achieves state-of-the-art generalization, higher success rates, longer task chains, and notable few-shot gains for novel skills, demonstrating scalable, robust language-grounded manipulation.
Abstract
The robotics community has consistently aimed to achieve generalizable robot manipulation with flexible natural language instructions. One primary challenge is that obtaining robot trajectories fully annotated with both actions and texts is time-consuming and labor-intensive. However, partially-annotated data, such as human activity videos without action labels and robot trajectories without text labels, are much easier to collect. Can we leverage these data to enhance the generalization capabilities of robots? In this paper, we propose GR-MG, a novel method which supports conditioning on a text instruction and a goal image. During training, GR-MG samples goal images from trajectories and conditions on both the text and the goal image or solely on the image when text is not available. During inference, where only the text is provided, GR-MG generates the goal image via a diffusion-based image-editing model and conditions on both the text and the generated image. This approach enables GR-MG to leverage large amounts of partially-annotated data while still using languages to flexibly specify tasks. To generate accurate goal images, we propose a novel progress-guided goal image generation model which injects task progress information into the generation process. In simulation experiments, GR-MG improves the average number of tasks completed in a row of 5 from 3.35 to 4.04. In real-robot experiments, GR-MG is able to perform 58 different tasks and improves the success rate from 68.7\% to 78.1\% and 44.4\% to 60.6\% in simple and generalization settings, respectively. It also outperforms comparing baseline methods in few-shot learning of novel skills. Video demos, code, and checkpoints are available on the project page: https://gr-mg.github.io/.
