Table of Contents
Fetching ...

Ground-Compose-Reinforce: Grounding Language in Agentic Behaviours using Limited Data

Andrew C. Li, Toryn Q. Klassen, Andrew Wang, Parand A. Alamdari, Sheila A. McIlraith

TL;DR

Ground-Compose-Reinforce (GCR) presents a neurosymbolic RL framework that grounds high-level task specifications into executable behaviours by composing Reward Machines with limited data, eliminating the need for handcrafted rewards or external oracles. The method first learns symbol grounding from a small labeled dataset, then uses compositional RM structures to generate self-supervised rewards for RL, aided by a novel compositional reward shaping scheme built from primitive value functions. Results in GeoGrid and DrawerWorld show strong task elicitation and zero-shot generalization to unseen RM compositions, with reward shaping crucial for long-horizon tasks. An optional NL interface demonstrates zero-shot autoformalization of natural language rewards into RM specifications, highlighting a path toward NL-driven, data-efficient task specification for embodied agents.

Abstract

Grounding language in perception and action is a key challenge when building situated agents that can interact with humans, or other agents, via language. In the past, addressing this challenge has required manually designing the language grounding or curating massive datasets that associate language with the environment. We propose Ground-Compose-Reinforce, an end-to-end, neurosymbolic framework for training RL agents directly from high-level task specifications--without manually designed reward functions or other domain-specific oracles, and without massive datasets. These task specifications take the form of Reward Machines, automata-based representations that capture high-level task structure and are in some cases autoformalizable from natural language. Critically, we show that Reward Machines can be grounded using limited data by exploiting compositionality. Experiments in a custom Meta-World domain with only 350 labelled pretraining trajectories show that our framework faithfully elicits complex behaviours from high-level specifications--including behaviours that never appear in pretraining--while non-compositional approaches fail.

Ground-Compose-Reinforce: Grounding Language in Agentic Behaviours using Limited Data

TL;DR

Ground-Compose-Reinforce (GCR) presents a neurosymbolic RL framework that grounds high-level task specifications into executable behaviours by composing Reward Machines with limited data, eliminating the need for handcrafted rewards or external oracles. The method first learns symbol grounding from a small labeled dataset, then uses compositional RM structures to generate self-supervised rewards for RL, aided by a novel compositional reward shaping scheme built from primitive value functions. Results in GeoGrid and DrawerWorld show strong task elicitation and zero-shot generalization to unseen RM compositions, with reward shaping crucial for long-horizon tasks. An optional NL interface demonstrates zero-shot autoformalization of natural language rewards into RM specifications, highlighting a path toward NL-driven, data-efficient task specification for embodied agents.

Abstract

Grounding language in perception and action is a key challenge when building situated agents that can interact with humans, or other agents, via language. In the past, addressing this challenge has required manually designing the language grounding or curating massive datasets that associate language with the environment. We propose Ground-Compose-Reinforce, an end-to-end, neurosymbolic framework for training RL agents directly from high-level task specifications--without manually designed reward functions or other domain-specific oracles, and without massive datasets. These task specifications take the form of Reward Machines, automata-based representations that capture high-level task structure and are in some cases autoformalizable from natural language. Critically, we show that Reward Machines can be grounded using limited data by exploiting compositionality. Experiments in a custom Meta-World domain with only 350 labelled pretraining trajectories show that our framework faithfully elicits complex behaviours from high-level specifications--including behaviours that never appear in pretraining--while non-compositional approaches fail.

Paper Structure

This paper contains 38 sections, 7 equations, 8 figures, 8 tables, 5 algorithms.

Figures (8)

  • Figure 1: Ground-Compose-Reinforce, a lightweight framework for training RL agents directly from Reward Machine specifications, without oracles like reward functions or feature detectors. In pretraining, we learn to map propositional symbols to context-specific truth values ("is the robot holding the red block?") and progress signals ("how close is the robot to holding the red block?"). To elicit behaviours, we prompt the agent via a Reward Machine composed of these symbols (or via natural language, if an autoformalizer is available). The agent then synthesizes its own dense reward function and interacts with the environment to learn the desired behaviour via RL.
  • Figure 2: Four temporally extended tasks in a gridworld expressed as Reward Machines over the propositions $\mathcal{AP} = \{\mathrm{\textcolor{red}{R}}, \mathrm{\textcolor{green}{G}}, \mathrm{\textcolor{blue}{B}}, \triangle, \bigcirc \}$. An edge labelled $\langle \varphi, r \rangle$ indicates the logical condition $\varphi$ for when the corresponding transition should be followed, and the reward $r$ that is yielded as a result. Doubled circles indicate terminal states, and we omit non-rewarding self-loop edges to aid readability.
  • Figure 3: An illustration of how we estimate optimal values in an RM-MDP. Suppose the agent is currently in RM state $u^A$ (green and bolded). To evaluate the expected return for the transition $u^A \to u^B$, we estimate how close the agent is to satisfying the formula on the transition (reaching the red triangle), and bootstrap with a coarse value estimate for RM state $u^B$. The overall value of $u^A$ is approximated by the maximum expected return across all outgoing transitions from $u^A$.
  • Figure 4: DrawerWorld is a custom Meta-World environment where the agent can interact with two drawers and three boxes. Propositions capture whether: each drawer is open; each box is lifted by the agent; a given box is in a given drawer.
  • Figure 5: An RM that produces a reward of $-1$ for each timestep the agent spends in lava, until it exits the lava.
  • ...and 3 more figures

Theorems & Definitions (3)

  • Definition 1
  • Definition 2
  • Definition 3