Table of Contents
Fetching ...

Language Guided Skill Discovery

Seungeun Rho, Laura Smith, Tianyu Li, Sergey Levine, Xue Bin Peng, Sehoon Ha

TL;DR

LGSD addresses the challenge of learning semantically diverse skills without task-specific rewards by leveraging large language models to define a language distance $d_{lang}$ between state descriptions. It constrains the skill search space with language prompts and learns a 1-Lipschitz latent mapping to maximize a Wasserstein-based objective, effectively aligning state changes with diverse semantic semantics. The approach enables zero-shot natural language goal following and improves downstream control by enabling high-quality skill repertoires, demonstrated through locomotion and manipulation tasks that outperform MI-based and distance-based baselines in diversity and sample efficiency. This work highlights a novel role for external semantic knowledge in skill discovery and opens avenues for integration with vision-language models and trajectory-level semantic analyses.

Abstract

Skill discovery methods enable agents to learn diverse emergent behaviors without explicit rewards. To make learned skills useful for unknown downstream tasks, obtaining a semantically diverse repertoire of skills is essential. While some approaches introduce a discriminator to distinguish skills and others aim to increase state coverage, no existing work directly addresses the "semantic diversity" of skills. We hypothesize that leveraging the semantic knowledge of large language models (LLMs) can lead us to improve semantic diversity of resulting behaviors. In this sense, we introduce Language Guided Skill Discovery (LGSD), a skill discovery framework that aims to directly maximize the semantic diversity between skills. LGSD takes user prompts as input and outputs a set of semantically distinctive skills. The prompts serve as a means to constrain the search space into a semantically desired subspace, and the generated LLM outputs guide the agent to visit semantically diverse states within the subspace. We demonstrate that LGSD enables legged robots to visit different user-intended areas on a plane by simply changing the prompt. Furthermore, we show that language guidance aids in discovering more diverse skills compared to five existing skill discovery methods in robot-arm manipulation environments. Lastly, LGSD provides a simple way of utilizing learned skills via natural language.

Language Guided Skill Discovery

TL;DR

LGSD addresses the challenge of learning semantically diverse skills without task-specific rewards by leveraging large language models to define a language distance between state descriptions. It constrains the skill search space with language prompts and learns a 1-Lipschitz latent mapping to maximize a Wasserstein-based objective, effectively aligning state changes with diverse semantic semantics. The approach enables zero-shot natural language goal following and improves downstream control by enabling high-quality skill repertoires, demonstrated through locomotion and manipulation tasks that outperform MI-based and distance-based baselines in diversity and sample efficiency. This work highlights a novel role for external semantic knowledge in skill discovery and opens avenues for integration with vision-language models and trajectory-level semantic analyses.

Abstract

Skill discovery methods enable agents to learn diverse emergent behaviors without explicit rewards. To make learned skills useful for unknown downstream tasks, obtaining a semantically diverse repertoire of skills is essential. While some approaches introduce a discriminator to distinguish skills and others aim to increase state coverage, no existing work directly addresses the "semantic diversity" of skills. We hypothesize that leveraging the semantic knowledge of large language models (LLMs) can lead us to improve semantic diversity of resulting behaviors. In this sense, we introduce Language Guided Skill Discovery (LGSD), a skill discovery framework that aims to directly maximize the semantic diversity between skills. LGSD takes user prompts as input and outputs a set of semantically distinctive skills. The prompts serve as a means to constrain the search space into a semantically desired subspace, and the generated LLM outputs guide the agent to visit semantically diverse states within the subspace. We demonstrate that LGSD enables legged robots to visit different user-intended areas on a plane by simply changing the prompt. Furthermore, we show that language guidance aids in discovering more diverse skills compared to five existing skill discovery methods in robot-arm manipulation environments. Lastly, LGSD provides a simple way of utilizing learned skills via natural language.
Paper Structure (46 sections, 20 equations, 12 figures, 3 tables, 1 algorithm)

This paper contains 46 sections, 20 equations, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: We proposed LGSD which can discover a semantically distinctive set of skills. We showcase four sample skills acquired from a single training run. Our approach successfully learned skills that manipulate only 'edible' objects (banana and meat_can) from a total of four objects.
  • Figure 2: Overview of how LGSD works. Given a prompt, the LLM generates the description for each state. We then measure the difference between these descriptions and denote it as $d_{\text{lang}}$. Based on $d_{\text{lang}}$, we constrain the latent space by enforcing the 1-Lipschitz condition on $\phi$. Then the agent is encouraged to visit states that make the vector $\phi(s') - \phi(s)$ aligns well with a randomly sampled vector $z$ from an isotropic Gaussian prior. This makes the agent explore the latent space in diverse directions depending on the sampled $z$.
  • Figure 3: By prompting the LLM to generate different descriptions depending on the state, LGSD can adapt its focus during training.
  • Figure 4: Trajectories of different skills trained with different prompts. For the Ant (top row), we recorded the base's $x,y$ coordinates. For the Franka robot-arm agent (bottom row), we recorded the $x, y$ coordinates of the object on the table.
  • Figure 5: Initial state of robot arm.
  • ...and 7 more figures