From Goal-Conditioned to Language-Conditioned Agents via Vision-Language Models

Theo Cachet; Christopher R. Dance; Olivier Sigaud

From Goal-Conditioned to Language-Conditioned Agents via Vision-Language Models

Theo Cachet, Christopher R. Dance, Olivier Sigaud

TL;DR

This paper introduces a novel decomposition of the problem of building an LCA: first find an environment configuration that has a high VLM score for text describing a task; then use a (pretrained) goal-conditioned policy to reach that configuration.

Abstract

Vision-language models (VLMs) have tremendous potential for grounding language, and thus enabling language-conditioned agents (LCAs) to perform diverse tasks specified with text. This has motivated the study of LCAs based on reinforcement learning (RL) with rewards given by rendering images of an environment and evaluating those images with VLMs. If single-task RL is employed, such approaches are limited by the cost and time required to train a policy for each new task. Multi-task RL (MTRL) is a natural alternative, but requires a carefully designed corpus of training tasks and does not always generalize reliably to new tasks. Therefore, this paper introduces a novel decomposition of the problem of building an LCA: first find an environment configuration that has a high VLM score for text describing a task; then use a (pretrained) goal-conditioned policy to reach that configuration. We also explore several enhancements to the speed and quality of VLM-based LCAs, notably, the use of distilled models, and the evaluation of configurations from multiple viewpoints to resolve the ambiguities inherent in a single 2D view. We demonstrate our approach on the Humanoid environment, showing that it results in LCAs that outperform MTRL baselines in zero-shot generalization, without requiring any textual task descriptions or other forms of environment-specific annotation during training. Videos and an interactive demo can be found at https://europe.naverlabs.com/text2control

From Goal-Conditioned to Language-Conditioned Agents via Vision-Language Models

TL;DR

Abstract

Paper Structure (78 sections, 19 equations, 37 figures, 7 tables, 1 algorithm)

This paper contains 78 sections, 19 equations, 37 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Annotation.
Foundation models (FMs).
Text-to-goal methods.
Methods
Definitions
Environment.
Configurations.
Rendering functions.
VLMs.
Configuration-text score.
Distilled model.
Text-to-Goal Generation Methods
Retrieving Configurations
...and 63 more sections

Figures (37)

Figure 1: Overview of the proposed approach. We use rendering functions and a VLM image encoder to precompute embeddings of all configurations in a dataset. Given text $x$ describing a new task, we embed it with the VLM text encoder, evaluate its cosine similarity with the precomputed image embeddings, select $k$ highest-scoring configurations, and (optionally) finetune them using a distilled model. Finally, the best configuration is fed to a pretrained goal-conditioned agent to execute the task. Note that our approach can retrieve and finetune configurations for a text $x$ without ever having seen that text before, and that the GCRL agent can reach a goal configuration without having been specifically trained for that goal; thus, our approach results in a zero-shot LCA.
Figure 2: Each column represents a task: (1) "downward-facing dog yoga pose", (2) "headstand", (3) "holding a blue box", (4) "box step-up", (5) "warrior yoga pose" and (6) "boxing guard". Within each column, the top three rows show the front, left and right views of the best configuration from the embedding-diversity dataset, retrieved using the front view only. The bottom three rows show the best configuration retrieved using the three views.
Figure 3: Each row shows a single task. Columns 1 and 3 show the top-1 and another of the top-20 retrieved configurations; columns 2 and 4 show their finetuned counterparts (without VLM selection).
Figure 4: Approximate VLM returns and approximate best-in-trajectory configuration-text scores for the STRL and MTRL baselines, and for the GCRL agent given four types of goal configuration: retrieval from the random policy dataset (GCRL-R); and retrieval from the diversity dataset (GCRL-D), plus finetuning (GCRL-F), plus selection based on the exact score (GCRL-S). MTRL (train) and MTRL (test) are for the baseline models evaluated on their training and test tasks respectively. The metrics are averaged over all 256 tasks, with 10 episodes per task.
Figure 5: Front-view of the best-in-trajectory configuration (with respect to the approximate VLM score) reached by GCRL-D and by the MTRL (test) baseline in a single rollout. Analogous figures for all LCAs on all 256 tasks are in Appendix \ref{['sec:reached-configs']}.
...and 32 more figures

From Goal-Conditioned to Language-Conditioned Agents via Vision-Language Models

TL;DR

Abstract

From Goal-Conditioned to Language-Conditioned Agents via Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (37)