Table of Contents
Fetching ...

Text-to-Decision Agent: Offline Meta-Reinforcement Learning from Natural Language Supervision

Shilin Zhang, Zican Hu, Wenhao Wu, Xinyi Xie, Jianxiang Tang, Chunlin Chen, Daoyi Dong, Yu Cheng, Zhenhong Sun, Zhi Wang

TL;DR

T2DA introduces a scalable framework for offline meta-RL that grounds natural language supervision in environment dynamics. It builds a dynamics-aware world model to encode multi-task data, then uses CLIP-style contrastive pre-training to align language descriptions with decision embeddings, enabling zero-shot text-to-decision generation. The approach supports two scalable policies—Text-to-Decision Diffuser and Text-to-Decision Transformer—achieving state-of-the-art zero-shot generalization on MuJoCo and Meta-World benchmarks with robustness to data quality and language-encoder choices. This work opens pathways for scalable, language-driven generalist offline RL agents and suggests directions for scaling to real-world embodied systems.

Abstract

Offline meta-RL usually tackles generalization by inferring task beliefs from high-quality samples or warmup explorations. The restricted form limits their generality and usability since these supervision signals are expensive and even infeasible to acquire in advance for unseen tasks. Learning directly from the raw text about decision tasks is a promising alternative to leverage a much broader source of supervision. In the paper, we propose \textbf{T}ext-to-\textbf{D}ecision \textbf{A}gent (\textbf{T2DA}), a simple and scalable framework that supervises offline meta-RL with natural language. We first introduce a generalized world model to encode multi-task decision data into a dynamics-aware embedding space. Then, inspired by CLIP, we predict which textual description goes with which decision embedding, effectively bridging their semantic gap via contrastive language-decision pre-training and aligning the text embeddings to comprehend the environment dynamics. After training the text-conditioned generalist policy, the agent can directly realize zero-shot text-to-decision generation in response to language instructions. Comprehensive experiments on MuJoCo and Meta-World benchmarks show that T2DA facilitates high-capacity zero-shot generalization and outperforms various types of baselines. Our code is available at \textcolor{magenta}{\href{https://github.com/NJU-RL/T2DA}{https://github.com/NJU-RL/T2DA}}.

Text-to-Decision Agent: Offline Meta-Reinforcement Learning from Natural Language Supervision

TL;DR

T2DA introduces a scalable framework for offline meta-RL that grounds natural language supervision in environment dynamics. It builds a dynamics-aware world model to encode multi-task data, then uses CLIP-style contrastive pre-training to align language descriptions with decision embeddings, enabling zero-shot text-to-decision generation. The approach supports two scalable policies—Text-to-Decision Diffuser and Text-to-Decision Transformer—achieving state-of-the-art zero-shot generalization on MuJoCo and Meta-World benchmarks with robustness to data quality and language-encoder choices. This work opens pathways for scalable, language-driven generalist offline RL agents and suggests directions for scaling to real-world embodied systems.

Abstract

Offline meta-RL usually tackles generalization by inferring task beliefs from high-quality samples or warmup explorations. The restricted form limits their generality and usability since these supervision signals are expensive and even infeasible to acquire in advance for unseen tasks. Learning directly from the raw text about decision tasks is a promising alternative to leverage a much broader source of supervision. In the paper, we propose \textbf{T}ext-to-\textbf{D}ecision \textbf{A}gent (\textbf{T2DA}), a simple and scalable framework that supervises offline meta-RL with natural language. We first introduce a generalized world model to encode multi-task decision data into a dynamics-aware embedding space. Then, inspired by CLIP, we predict which textual description goes with which decision embedding, effectively bridging their semantic gap via contrastive language-decision pre-training and aligning the text embeddings to comprehend the environment dynamics. After training the text-conditioned generalist policy, the agent can directly realize zero-shot text-to-decision generation in response to language instructions. Comprehensive experiments on MuJoCo and Meta-World benchmarks show that T2DA facilitates high-capacity zero-shot generalization and outperforms various types of baselines. Our code is available at \textcolor{magenta}{\href{https://github.com/NJU-RL/T2DA}{https://github.com/NJU-RL/T2DA}}.

Paper Structure

This paper contains 29 sections, 9 equations, 8 figures, 13 tables, 6 algorithms.

Figures (8)

  • Figure 1: t-SNE visualization of Ant-Dir where tasks with target directions in $[0,2\pi]$ are mapped to rainbow-colored points. Top: we encode multi-task data into dynamics-aware decision embeddings to capture task-specific environment dynamics. Bottom: we bridge the semantic gap between text and decision via contrastive pre-training. The aligned text embeddings follow a cyclic spectrum that exactly matches the periodicity of angular directions in a physical sense. This interesting finding shows that we effectively align text embeddings to comprehend environment dynamics and facilitate convincing language grounding in decision domains.
  • Figure 2: The overall pipeline of T2DA. (a) We encode the multi-task trajectories into dynamics-aware decision embeddings and decode the generalized world model conditioned on that embedding, effectively capturing the environment dynamics. (b) We bridge the semantic gap between decision and text by fine-tuning the text encoder (initialized from popular language models such as CLIP or T5) to align the produced text embeddings with dynamics-aware decision embeddings using contrastive loss. It distills the world model structure from decision embeddings to the text modality, aligning text embeddings to comprehend the environment dynamics. (c) We condition the generalist policy on the aligned text embeddings, and develop scalable implementations with the potential to train decision models at scale: Text-to-Decision Diffuser and Text-to-Decision Transformer. During evaluation, the agent can directly realize zero-shot text-to-decision generation according to textual instructions at hand, enabling high-capacity zero-shot generalization to downstream tasks.
  • Figure 3: Zero-shot test return curves of T2DA against baselines using Mixed datasets.
  • Figure 4: Ablation results using Mixed datasets. w/o world omits pre-training the trajectory encoder, w/o align omits contrastive pre-training, and w/o text omits the language supervision.
  • Figure 5: Results of robustness to data quality, where T2DA is compared to baselines using Expert and Medium datasets. T2DA-D and T2DA-T achieve consistent superiority across various datasets.
  • ...and 3 more figures