Table of Contents
Fetching ...

GR-MG: Leveraging Partially Annotated Data via Multi-Modal Goal-Conditioned Policy

Peiyan Li, Hongtao Wu, Yan Huang, Chilam Cheang, Liang Wang, Tao Kong

TL;DR

GR-MG tackles language-conditioned robotic manipulation by leveraging partially-annotated data through a two-module framework: a progress-guided diffusion-based goal image generator and a multi-modal goal-conditioned policy that uses both a text instruction $l$ and a generated goal image. The goal image is produced via diffusion conditioned on the current observation and progress (represented as a discretized cue), and the policy, a GPT-style transformer, predicts a $k$-step action trajectory while also regressing task progress, using the MAE-tokenized goal image as input. Training integrates data without action labels and data without text labels, enabling full use of datasets that are easier to collect, while inference uses text plus the generated goal image to guide actions. Across CALVIN and real-robot experiments, GR-MG achieves state-of-the-art generalization, higher success rates, longer task chains, and notable few-shot gains for novel skills, demonstrating scalable, robust language-grounded manipulation.

Abstract

The robotics community has consistently aimed to achieve generalizable robot manipulation with flexible natural language instructions. One primary challenge is that obtaining robot trajectories fully annotated with both actions and texts is time-consuming and labor-intensive. However, partially-annotated data, such as human activity videos without action labels and robot trajectories without text labels, are much easier to collect. Can we leverage these data to enhance the generalization capabilities of robots? In this paper, we propose GR-MG, a novel method which supports conditioning on a text instruction and a goal image. During training, GR-MG samples goal images from trajectories and conditions on both the text and the goal image or solely on the image when text is not available. During inference, where only the text is provided, GR-MG generates the goal image via a diffusion-based image-editing model and conditions on both the text and the generated image. This approach enables GR-MG to leverage large amounts of partially-annotated data while still using languages to flexibly specify tasks. To generate accurate goal images, we propose a novel progress-guided goal image generation model which injects task progress information into the generation process. In simulation experiments, GR-MG improves the average number of tasks completed in a row of 5 from 3.35 to 4.04. In real-robot experiments, GR-MG is able to perform 58 different tasks and improves the success rate from 68.7\% to 78.1\% and 44.4\% to 60.6\% in simple and generalization settings, respectively. It also outperforms comparing baseline methods in few-shot learning of novel skills. Video demos, code, and checkpoints are available on the project page: https://gr-mg.github.io/.

GR-MG: Leveraging Partially Annotated Data via Multi-Modal Goal-Conditioned Policy

TL;DR

GR-MG tackles language-conditioned robotic manipulation by leveraging partially-annotated data through a two-module framework: a progress-guided diffusion-based goal image generator and a multi-modal goal-conditioned policy that uses both a text instruction and a generated goal image. The goal image is produced via diffusion conditioned on the current observation and progress (represented as a discretized cue), and the policy, a GPT-style transformer, predicts a -step action trajectory while also regressing task progress, using the MAE-tokenized goal image as input. Training integrates data without action labels and data without text labels, enabling full use of datasets that are easier to collect, while inference uses text plus the generated goal image to guide actions. Across CALVIN and real-robot experiments, GR-MG achieves state-of-the-art generalization, higher success rates, longer task chains, and notable few-shot gains for novel skills, demonstrating scalable, robust language-grounded manipulation.

Abstract

The robotics community has consistently aimed to achieve generalizable robot manipulation with flexible natural language instructions. One primary challenge is that obtaining robot trajectories fully annotated with both actions and texts is time-consuming and labor-intensive. However, partially-annotated data, such as human activity videos without action labels and robot trajectories without text labels, are much easier to collect. Can we leverage these data to enhance the generalization capabilities of robots? In this paper, we propose GR-MG, a novel method which supports conditioning on a text instruction and a goal image. During training, GR-MG samples goal images from trajectories and conditions on both the text and the goal image or solely on the image when text is not available. During inference, where only the text is provided, GR-MG generates the goal image via a diffusion-based image-editing model and conditions on both the text and the generated image. This approach enables GR-MG to leverage large amounts of partially-annotated data while still using languages to flexibly specify tasks. To generate accurate goal images, we propose a novel progress-guided goal image generation model which injects task progress information into the generation process. In simulation experiments, GR-MG improves the average number of tasks completed in a row of 5 from 3.35 to 4.04. In real-robot experiments, GR-MG is able to perform 58 different tasks and improves the success rate from 68.7\% to 78.1\% and 44.4\% to 60.6\% in simple and generalization settings, respectively. It also outperforms comparing baseline methods in few-shot learning of novel skills. Video demos, code, and checkpoints are available on the project page: https://gr-mg.github.io/.
Paper Structure (23 sections, 2 equations, 5 figures, 3 tables)

This paper contains 23 sections, 2 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview. GR-MG consists of two modules: a progress-guided goal image generation model and a multi-modal goal-conditioned policy. The former generates a goal image based on the current observation, text instruction, and task progress. The latter predicts task progress and actions based on the text and the goal image produced by the former. Data without action labels can be used to train the goal image generation model, while the multi-modal goal-conditioned policy can leverage data without text labels. Fully-annotated data is used for training both modules.
  • Figure 2: Network Architecture. We use a diffusion model to generate the goal image based on the current image, text instruction, and task progress. The generated goal image is then sent to the multi-modal goal-conditioned policy, which takes as inputs the text instruction, sequences of observations images and robot states. The policy is a GPT-style transformer. We insert query tokens [PROG], [OBS], and [ACT] after the input tokens to predict actions, future images, and the task progress, respectively. The progress is fed back to the goal image generation model. In this figure, we show the setup with two camera views: a static view and a view captured from a wrist-mounted camera.
  • Figure 3: Experiments. CALVIN benchmark consists of 34 tasks across four different environments. Real-robot experiments encompass 58 tasks, including pick-and-place and non-pick-and-place manipulations.
  • Figure 4: Visualization of the generated goal images in CALVIN Benchmark and Real-Robot Experiments. (a) The generated images of GR-MG closely align with the ground truths. Without task progress information, the goal images generated by GR-MG w/o progress diverge from the text instructions. (b) Without partially-annotated data, the generated goal images do not adhere to the text instructions and suffer from hallucination.
  • Figure 5: Success Rates of Real-Robot Experiments. Unseen Average shows the average success rate of the four unseen generalization settings.