Language Instructed Reinforcement Learning for Human-AI Coordination

Hengyuan Hu; Dorsa Sadigh

Language Instructed Reinforcement Learning for Human-AI Coordination

Hengyuan Hu, Dorsa Sadigh

TL;DR

This work tackles the challenge of aligning human-AI coordination in multi-agent RL when abundant human data is unavailable. It introduces instructRL, a framework that uses a large language model to generate a prior policy conditioned on natural language instructions and regularizes RL training toward that prior, yielding human-preferred equilibria. The approach is validated in a toy Say-Select game and the Hanabi benchmark, showing that different instructions can produce semantically distinct, human-aligned policies and that humans coordinate far better when aware of the training instructions. The results suggest a scalable path to improve human-AI collaboration without large labeled human datasets, with promising directions for test-time adaptation and multi-modal instruction grounding.

Abstract

One of the fundamental quests of AI is to produce agents that coordinate well with humans. This problem is challenging, especially in domains that lack high quality human behavioral data, because multi-agent reinforcement learning (RL) often converges to different equilibria from the ones that humans prefer. We propose a novel framework, instructRL, that enables humans to specify what kind of strategies they expect from their AI partners through natural language instructions. We use pretrained large language models to generate a prior policy conditioned on the human instruction and use the prior to regularize the RL objective. This leads to the RL agent converging to equilibria that are aligned with human preferences. We show that instructRL converges to human-like policies that satisfy the given instructions in a proof-of-concept environment as well as the challenging Hanabi benchmark. Finally, we show that knowing the language instruction significantly boosts human-AI coordination performance in human evaluations in Hanabi.

Language Instructed Reinforcement Learning for Human-AI Coordination

TL;DR

Abstract

Paper Structure (18 sections, 2 equations, 11 figures, 6 tables, 1 algorithm)

This paper contains 18 sections, 2 equations, 11 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Background
Method
Experiment
Say-Select Experiment
Hanabi Experiment
Conclusions
Acknowledgments
Implementation Details and Hyper-parameters
Say-Select Experiments
Hanabi Experiments
Illustration of Hanabi
Robustness Analysis in Hanabi
Accuracy of the LLM Priors
...and 3 more sections

Figures (11)

Figure 1: Illustration of one episode of the toy example. Left: At the beginning of the episode, two random balls are assigned with +1 while the others are assigned with -1. Alice says '1' to Bob. Bob picks up ball #1 and the team gets +1 reward. Middle: The ball is put back to the table but now assigned with -1. Alice says '5' to Bob and Bob picks up ball #5. Right: Now that all the balls have -1 reward. Alice says '5' again to Bob. Bob realizes there must be no positive reward balls left, so he quits.
Figure 2: InstructQ. The differences between instructQ and normal Q-learning is highlighted in blue.
Figure 3: Bob's policy trained with different methods. Row values are Alice's actions two steps ago and column values are Alice's actions one step ago. The value in each cell is Bob's action when observing Alice's past two actions. Here Bob's actions are 1 through 5 (shown in different shades of blue) for selecting different balls and "Q" (shown in yellow) refers to Bob quitting. Left and Middle: Two policies from vanilla Q-learning but with different seeds. Right: Policy from instructQ with $\lambda=0.25$. We note that all three policies shown here are optimal in self-play, but only the InstructQ policy is the intuitive policy that follows inst= "I should select the same number as my partner".
Figure 4: Conditional action matrix $p(a_{t+1} | a_t)$ for different agents. We only show most relevant action pairs for conciseness. The row values are the actions from the active player at time step $t$. Cr through Cw correspond to the action of hinting color red, green blue yellow and white respectively. R1 through R5 correspond to the actions of hinting rank 1 through 5. The column values are the actions from the active player at time step $t+1$. P1 through P5 correspond to playing the card at position 1 through 5 with 5 being the newest position. For each cell $p_{(i,j)}$, we first count all occurrences of the action pair over 1000 games and then normalize it $\sum_{i,j} p_{i,j} = 1$. Bright yellow means high probability and dark blue means low probability. All the policies focus on playing their newest cards but they demonstrate different hinting strategies.
Figure 5: Knowledge of cards when the agent plays those cards. The knowledge of the cards is either revealed by hints or inferred public knowledge such as counting the remaining cards. Only color: player knows the color but not the rank. Only rank: player knows the rank but not the color. Both: player knows exactly what the card is. None: player knows nothing about this card.
...and 6 more figures

Language Instructed Reinforcement Learning for Human-AI Coordination

TL;DR

Abstract

Language Instructed Reinforcement Learning for Human-AI Coordination

Authors

TL;DR

Abstract

Table of Contents

Figures (11)