Table of Contents
Fetching ...

GROOT-2: Weakly Supervised Multi-Modal Instruction Following Agents

Shaofei Cai, Bowei Zhang, Zihao Wang, Haowei Lin, Xiaojian Ma, Anji Liu, Yitao Liang

TL;DR

GROOT-2 tackles open-world multimodal instruction following under weak supervision by combining constrained self-imitating on unlabeled demonstrations with human intent alignment from a small labeled set. It leverages a latent variable framework with a VAE-inspired structure and a Transformer-XL policy to map multimodal instructions into a shared latent space and condition actions, guided by two objectives that balance reconstruction and alignment. The approach is validated across four diverse environments (Atari, Minecraft, Language Table, Simpler Env), showing improved instruction following when leveraging both unlabeled and labeled data and revealing how latent space factors such as the ratio $R = \frac{BC}{BC + KL}$ govern behavior. The results suggest that multimodal instruction following with weak supervision can scale with data, benefiting from language and video cues, and offers a practical path toward flexible, human-aligned agents.

Abstract

Developing agents that can follow multimodal instructions remains a fundamental challenge in robotics and AI. Although large-scale pre-training on unlabeled datasets (no language instruction) has enabled agents to learn diverse behaviors, these agents often struggle with following instructions. While augmenting the dataset with instruction labels can mitigate this issue, acquiring such high-quality annotations at scale is impractical. To address this issue, we frame the problem as a semi-supervised learning task and introduce GROOT-2, a multimodal instructable agent trained using a novel approach that combines weak supervision with latent variable models. Our method consists of two key components: constrained self-imitating, which utilizes large amounts of unlabeled demonstrations to enable the policy to learn diverse behaviors, and human intention alignment, which uses a smaller set of labeled demonstrations to ensure the latent space reflects human intentions. GROOT-2's effectiveness is validated across four diverse environments, ranging from video games to robotic manipulation, demonstrating its robust multimodal instruction-following capabilities.

GROOT-2: Weakly Supervised Multi-Modal Instruction Following Agents

TL;DR

GROOT-2 tackles open-world multimodal instruction following under weak supervision by combining constrained self-imitating on unlabeled demonstrations with human intent alignment from a small labeled set. It leverages a latent variable framework with a VAE-inspired structure and a Transformer-XL policy to map multimodal instructions into a shared latent space and condition actions, guided by two objectives that balance reconstruction and alignment. The approach is validated across four diverse environments (Atari, Minecraft, Language Table, Simpler Env), showing improved instruction following when leveraging both unlabeled and labeled data and revealing how latent space factors such as the ratio govern behavior. The results suggest that multimodal instruction following with weak supervision can scale with data, benefiting from language and video cues, and offers a practical path toward flexible, human-aligned agents.

Abstract

Developing agents that can follow multimodal instructions remains a fundamental challenge in robotics and AI. Although large-scale pre-training on unlabeled datasets (no language instruction) has enabled agents to learn diverse behaviors, these agents often struggle with following instructions. While augmenting the dataset with instruction labels can mitigate this issue, acquiring such high-quality annotations at scale is impractical. To address this issue, we frame the problem as a semi-supervised learning task and introduce GROOT-2, a multimodal instructable agent trained using a novel approach that combines weak supervision with latent variable models. Our method consists of two key components: constrained self-imitating, which utilizes large amounts of unlabeled demonstrations to enable the policy to learn diverse behaviors, and human intention alignment, which uses a smaller set of labeled demonstrations to ensure the latent space reflects human intentions. GROOT-2's effectiveness is validated across four diverse environments, ranging from video games to robotic manipulation, demonstrating its robust multimodal instruction-following capabilities.

Paper Structure

This paper contains 40 sections, 6 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: By feeding a mixture of demonstrations and some multimodal labels, we learn GROOT-2, a human-aligned agent capable of understanding multimodal instructions and adaptable to various environments, ranging from video games to robot manipulation, including Atari, Minecraft, Language Table, and Simpler Env.
  • Figure 2: The ELBO Objective of the VAE and Latent Space Spectrum. We define a spectrum based on $R = \frac{BC}{BC + KL}$, where $R = 0$ corresponds to “mechanical imitation” and $R = 1$ to “posterior collapse.” At low $R$, latent vector $z$ directly outputs action sequences without considering observations ( $BC \to 0$ ). As $R$ increases, $z$ represents high-level task information, such as specific object interactions. At $R = 1$, $z$ provides no beneficial information for decision-making.
  • Figure 3: Comparison of Policies with Different Latent Spaces. The reference video depicts digging for diamonds. A policy that mechanically imitates the trajectory falls into lava, while one aligned with human intention avoids lava and successfully reaches the diamonds.
  • Figure 4: Pipeline for Constructing a Training Batch for GROOT-2. Each batch includes two sample types: (1) demonstration-only samples for learning a latent-conditioned policy (Constrained Self-Imitating); and (2) labeled samples (text or expected returns) for aligning the latent space with human intentions (Human Intention Alignment). The sample ratio varies by dataset distribution.
  • Figure 5: Diverse visual environments used in the experiments. We test our GROOT-2 on both video games (simple Atari games and the complex Minecraft game) and robotic manipulation environments (Language Table and Simpler Env). Minecraft is a partially observable open-ended environment, while others are fully observable.
  • ...and 7 more figures