Table of Contents
Fetching ...

Policy Learning with a Language Bottleneck

Megha Srivastava, Cedric Colas, Dorsa Sadigh, Jacob Andreas

TL;DR

PLLB introduces a language bottleneck that alternates between generating linguistic rules from contrasting high- and low-reward experiences and updating policies to align with those rules. This approach yields more interpretable, generalizable, and human-interoperable behaviors across five diverse tasks, including signaling games, maze navigation, collaborative image reconstruction, and robotic grasping, with open-source code available. By enabling agents to generate and share rules, PLLB facilitates better human-AI coordination and cultural transmission of problem-solving strategies, while still leveraging non-linguistic learning signals. The results demonstrate that language-informed policy learning can enhance both cognitive performance and communicative usefulness beyond traditional RL baselines.

Abstract

Modern AI systems such as self-driving cars and game-playing agents achieve superhuman performance, but often lack human-like generalization, interpretability, and inter-operability with human users. Inspired by the rich interactions between language and decision-making in humans, we introduce Policy Learning with a Language Bottleneck (PLLB), a framework enabling AI agents to generate linguistic rules that capture the high-level strategies underlying rewarding behaviors. PLLB alternates between a *rule generation* step guided by language models, and an *update* step where agents learn new policies guided by rules, even when a rule is insufficient to describe an entire complex policy. Across five diverse tasks, including a two-player signaling game, maze navigation, image reconstruction, and robot grasp planning, we show that PLLB agents are not only able to learn more interpretable and generalizable behaviors, but can also share the learned rules with human users, enabling more effective human-AI coordination. We provide source code for our experiments at https://github.com/meghabyte/bottleneck .

Policy Learning with a Language Bottleneck

TL;DR

PLLB introduces a language bottleneck that alternates between generating linguistic rules from contrasting high- and low-reward experiences and updating policies to align with those rules. This approach yields more interpretable, generalizable, and human-interoperable behaviors across five diverse tasks, including signaling games, maze navigation, collaborative image reconstruction, and robotic grasping, with open-source code available. By enabling agents to generate and share rules, PLLB facilitates better human-AI coordination and cultural transmission of problem-solving strategies, while still leveraging non-linguistic learning signals. The results demonstrate that language-informed policy learning can enhance both cognitive performance and communicative usefulness beyond traditional RL baselines.

Abstract

Modern AI systems such as self-driving cars and game-playing agents achieve superhuman performance, but often lack human-like generalization, interpretability, and inter-operability with human users. Inspired by the rich interactions between language and decision-making in humans, we introduce Policy Learning with a Language Bottleneck (PLLB), a framework enabling AI agents to generate linguistic rules that capture the high-level strategies underlying rewarding behaviors. PLLB alternates between a *rule generation* step guided by language models, and an *update* step where agents learn new policies guided by rules, even when a rule is insufficient to describe an entire complex policy. Across five diverse tasks, including a two-player signaling game, maze navigation, image reconstruction, and robot grasp planning, we show that PLLB agents are not only able to learn more interpretable and generalizable behaviors, but can also share the learned rules with human users, enabling more effective human-AI coordination. We provide source code for our experiments at https://github.com/meghabyte/bottleneck .
Paper Structure (36 sections, 1 equation, 15 figures, 5 tables)

This paper contains 36 sections, 1 equation, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Policy Learning with a Language Bottleneck (PLLB) alternates between two steps: 1) gen_rule (bottom) generates a linguistic rule $\mathcal{L}_i$ explaining the agent's best behaviors by prompting a language model with contrastive (positive and negative) episodes (middle); 2) update (top) learns a new policy conditioned on that rule. PLLB facilitates stronger human-AI coordination by constraining policies to be more interpretable and generalize better.
  • Figure 2: PLLB can be applied to a diversity of domains. In all domains, it iterates between gen_rule, a function that prompts an LM to extract a linguistic rule ($\mathcal{L}$, blue) by contrasting high and low reward episodes from past experience (green and red boxes), and update, a function for updating an agent's policy through interaction with the environment conditioned on $\mathcal{L}$. PLLB can be applied to multi-step decision making tasks and visual or robotics tasks alike with minimal implementation variations (replacing the LM by a VLM in gen_rule, or replacing policy regularization by simple instruction conditioning in update. Exact prompts are in Appendix Section \ref{['app:prompts']}.
  • Figure 3: Overview of SaySelect game, reproduced from hu2023instructrl.
  • Figure 4: SaySelect. a) Bottleneck agents learn as fast as TabularQ and InstructRL and faster than agents using adversarial rules Adversarial. b) Unlike TabularQ, they learn human-interpretable policies; this without relying on an external instruction like InstructRL. c) When faced with speakers enforcing a non-human-interpretable policy, Bottleneck converges faster than InstructRL.
  • Figure 5: Results in Maze. a) Bottleneck agents learn as fast as the non-linguistic Baseline agents, but faster than Adversarial and LinearQ agents. b) When faced with a new maze with similar structure, Bottleneck agents learn faster than TabularQ (which does not perceive color) and LinearQ(which does, but learns slower). c) When faced with a maze of a different structure, Bottleneck agents adapt swiftly while LinearQ cannot recover. We do not evaluate Adversarial for the Generalization or Adaptation experiments due to its poor performance in the standard setting.
  • ...and 10 more figures