Table of Contents
Fetching ...

Learning with Language-Guided State Abstractions

Andi Peng, Ilia Sucholutsky, Belinda Z. Li, Theodore R. Sumers, Thomas L. Griffiths, Jacob Andreas, Julie A. Shah

TL;DR

The paper introduces Language-Guided Abstraction (LGA), a framework that uses natural language descriptions and language models to automatically construct task-relevant state abstractions for imitation learning. By transforming raw perceptual inputs into a text-based feature set, selecting relevant features with an LM, and instantiating a compact abstract state, LGA enables an abstraction-conditioned policy to learn over simplified representations. Empirical results in the VIMA environment show that LGA abstractions are on par with human-designed ones in effectiveness, while substantially reducing human effort, and that policies trained with LGA abstractions generalize robustly to covariate shifts and linguistic ambiguities, including zero-shot generalization to unseen commands. Real-world Spot robot experiments further demonstrate LGA's practical impact for robust, sample-efficient mobile manipulation with distractors and ambiguous goals.

Abstract

We describe a framework for using natural language to design state abstractions for imitation learning. Generalizable policy learning in high-dimensional observation spaces is facilitated by well-designed state representations, which can surface important features of an environment and hide irrelevant ones. These state representations are typically manually specified, or derived from other labor-intensive labeling procedures. Our method, LGA (language-guided abstraction), uses a combination of natural language supervision and background knowledge from language models (LMs) to automatically build state representations tailored to unseen tasks. In LGA, a user first provides a (possibly incomplete) description of a target task in natural language; next, a pre-trained LM translates this task description into a state abstraction function that masks out irrelevant features; finally, an imitation policy is trained using a small number of demonstrations and LGA-generated abstract states. Experiments on simulated robotic tasks show that LGA yields state abstractions similar to those designed by humans, but in a fraction of the time, and that these abstractions improve generalization and robustness in the presence of spurious correlations and ambiguous specifications. We illustrate the utility of the learned abstractions on mobile manipulation tasks with a Spot robot.

Learning with Language-Guided State Abstractions

TL;DR

The paper introduces Language-Guided Abstraction (LGA), a framework that uses natural language descriptions and language models to automatically construct task-relevant state abstractions for imitation learning. By transforming raw perceptual inputs into a text-based feature set, selecting relevant features with an LM, and instantiating a compact abstract state, LGA enables an abstraction-conditioned policy to learn over simplified representations. Empirical results in the VIMA environment show that LGA abstractions are on par with human-designed ones in effectiveness, while substantially reducing human effort, and that policies trained with LGA abstractions generalize robustly to covariate shifts and linguistic ambiguities, including zero-shot generalization to unseen commands. Real-world Spot robot experiments further demonstrate LGA's practical impact for robust, sample-efficient mobile manipulation with distractors and ambiguous goals.

Abstract

We describe a framework for using natural language to design state abstractions for imitation learning. Generalizable policy learning in high-dimensional observation spaces is facilitated by well-designed state representations, which can surface important features of an environment and hide irrelevant ones. These state representations are typically manually specified, or derived from other labor-intensive labeling procedures. Our method, LGA (language-guided abstraction), uses a combination of natural language supervision and background knowledge from language models (LMs) to automatically build state representations tailored to unseen tasks. In LGA, a user first provides a (possibly incomplete) description of a target task in natural language; next, a pre-trained LM translates this task description into a state abstraction function that masks out irrelevant features; finally, an imitation policy is trained using a small number of demonstrations and LGA-generated abstract states. Experiments on simulated robotic tasks show that LGA yields state abstractions similar to those designed by humans, but in a fraction of the time, and that these abstractions improve generalization and robustness in the presence of spurious correlations and ambiguous specifications. We illustrate the utility of the learned abstractions on mobile manipulation tasks with a Spot robot.
Paper Structure (24 sections, 3 equations, 4 figures)

This paper contains 24 sections, 3 equations, 4 figures.

Figures (4)

  • Figure 1: A: Example demonstration in our environment, showing Spot picking up an orange and bringing it to the user. B: Our approach, Language Guided Abstraction (LGA), creates a state abstraction with task-relevant features identified by an LM. The policy is learned directly over this abstracted state.
  • Figure 2: We evaluate on three task settings in VIMA. A: Pick-and-place. B: Rotate. C: Sweep while avoid. Red bounding boxes depict task-relevant features which must be accounted for in the abstraction.
  • Figure 3: (Q1)A: Comparing task performance (averaged over all tasks) of each method when controlling the number of training demonstrations. B: Comparing the amount of time (averaged over all tasks) that human users spent specifying task-relevant features for each method. LGA outperforms baselines on task performance while significantly reducing user time spent compared to manual feature specification ($p < 0.001$).
  • Figure 4: (Q2)A: Results on state covariate shifts. LGA variants that condition on state abstractions (with or without the original observation) outperform LGA-L (which conditions on the language abstraction only) and GCBC+DART (which attempts to use noise injection to handle covariate shift). (Q3)B: Results on multi-task ambiguity. We observe the same trends, with LGA variants that condition on state abstractions able to resolve previously unseen linguistic commands.