Table of Contents
Fetching ...

Motif: Intrinsic Motivation from Artificial Intelligence Feedback

Martin Klissarov, Pierluca D'Oro, Shagun Sodhani, Roberta Raileanu, Pierre-Luc Bacon, Pascal Vincent, Amy Zhang, Mikael Henaff

TL;DR

Motif tackles the challenge of providing intrinsic motivation to agents without explicit manual task rewards by deriving a reward function from an LLM's preferences over captioned observations. The method operates offline to train an intrinsic reward from AI feedback, then uses PPO-based RL to optimize a combination of this intrinsic reward with environmental rewards. On NetHack, Motif can outperform direct score optimization and, when combined with extrinsic rewards, surpasses existing baselines, including in sparse tasks. The study also analyzes the alignment, scalability, and steerability of Motif, showing that larger LLMs and prompt design influence behavior and that prompts can steer agents toward diverse strategies, while highlighting phenomena like misalignment by composition.

Abstract

Exploring rich environments and evaluating one's actions without prior knowledge is immensely challenging. In this paper, we propose Motif, a general method to interface such prior knowledge from a Large Language Model (LLM) with an agent. Motif is based on the idea of grounding LLMs for decision-making without requiring them to interact with the environment: it elicits preferences from an LLM over pairs of captions to construct an intrinsic reward, which is then used to train agents with reinforcement learning. We evaluate Motif's performance and behavior on the challenging, open-ended and procedurally-generated NetHack game. Surprisingly, by only learning to maximize its intrinsic reward, Motif achieves a higher game score than an algorithm directly trained to maximize the score itself. When combining Motif's intrinsic reward with the environment reward, our method significantly outperforms existing approaches and makes progress on tasks where no advancements have ever been made without demonstrations. Finally, we show that Motif mostly generates intuitive human-aligned behaviors which can be steered easily through prompt modifications, while scaling well with the LLM size and the amount of information given in the prompt.

Motif: Intrinsic Motivation from Artificial Intelligence Feedback

TL;DR

Motif tackles the challenge of providing intrinsic motivation to agents without explicit manual task rewards by deriving a reward function from an LLM's preferences over captioned observations. The method operates offline to train an intrinsic reward from AI feedback, then uses PPO-based RL to optimize a combination of this intrinsic reward with environmental rewards. On NetHack, Motif can outperform direct score optimization and, when combined with extrinsic rewards, surpasses existing baselines, including in sparse tasks. The study also analyzes the alignment, scalability, and steerability of Motif, showing that larger LLMs and prompt design influence behavior and that prompts can steer agents toward diverse strategies, while highlighting phenomena like misalignment by composition.

Abstract

Exploring rich environments and evaluating one's actions without prior knowledge is immensely challenging. In this paper, we propose Motif, a general method to interface such prior knowledge from a Large Language Model (LLM) with an agent. Motif is based on the idea of grounding LLMs for decision-making without requiring them to interact with the environment: it elicits preferences from an LLM over pairs of captions to construct an intrinsic reward, which is then used to train agents with reinforcement learning. We evaluate Motif's performance and behavior on the challenging, open-ended and procedurally-generated NetHack game. Surprisingly, by only learning to maximize its intrinsic reward, Motif achieves a higher game score than an algorithm directly trained to maximize the score itself. When combining Motif's intrinsic reward with the environment reward, our method significantly outperforms existing approaches and makes progress on tasks where no advancements have ever been made without demonstrations. Finally, we show that Motif mostly generates intuitive human-aligned behaviors which can be steered easily through prompt modifications, while scaling well with the LLM size and the amount of information given in the prompt.
Paper Structure (29 sections, 2 equations, 28 figures, 2 tables)

This paper contains 29 sections, 2 equations, 28 figures, 2 tables.

Figures (28)

  • Figure 1: NetHack score for Motif and baselines. Agents trained exclusively with Motif's intrinsic reward surprisingly outperform agents trained using the score itself, and perform even better when trained with a combination of the two reward functions.
  • Figure 2: A schematic representation of the three phases of Motif. In the first phase, dataset annotation, we extract preferences from an LLM over pairs of captions, and save the corresponding pairs of observations in a dataset alongside their annotations. In the second phase, reward training, we distill the preferences into an observation-based scalar reward function. In the third phase, RL training, we train an agent interactively with RL using the reward function extracted from the preferences, possibly together with a reward signal coming from the environment.
  • Figure 3: Success rate of Motif and baselines on sparse-reward tasks. Motif is sample-efficient and makes progress where no baseline learns useful behaviors. In Appendix \ref{['appendix:baselines']}, we additionally compare to E3B HenaffRJR22 and NovelD ZhangNeurips21, finding no benefits over RND.
  • Figure 4: Comparison along different axes of policy quality of agents trained with Motif's and environment's reward functions.
  • Figure 5: Illustration of the behavior of Motif on the oracle task. The agent @ first has to survive thousands of steps, waiting to encounter F (a yellow mold), a special kind of monster that contains an hallucinogen in its body (1). Agent @ kills F (2) and then immediately eats its corpse % (3). Eating the corpse of F brings the agent to the special hallucinating status, as denoted by the Hallu shown at the bottom of the screen (4). The behavior then changes, and the agent seeks to find a monster and remain non-aggressive, even if the monster may attack (5). If the agent survives this encounter and the hallucination period is not over, agent @ will see the monster under different appearances, for example here as a Yeti Y . Eventually, it will hallucinate the oracle @ and complete the task (6).
  • ...and 23 more figures