Table of Contents
Fetching ...

Beating Atari with Natural Language Guided Reinforcement Learning

Russell Kaplan, Christopher Sauer, Alexander Sosa

TL;DR

This work introduces an Instructed Reinforcement Learning framework that couples natural language guidance with multimodal frame–text embeddings to learn tasks in Atari, notably Montezuma's Revenge. By awarding instruction-completion rewards and enriching the RL signal with language-derived features, the agent surpasses strong baselines (DQN, A3C) and open-playground benchmarks, while demonstrating generalization of learned frame–command mappings across rooms. The approach addresses sparse rewards and stateful environments by leveraging human-provided instructions and robust language–vision alignment, with promising implications for real-world robotics and human–AI collaboration. The paper also provides a dataset and architectural blueprint for learning and deploying frame–command embeddings in dynamic, multimodal settings.

Abstract

We introduce the first deep reinforcement learning agent that learns to beat Atari games with the aid of natural language instructions. The agent uses a multimodal embedding between environment observations and natural language to self-monitor progress through a list of English instructions, granting itself reward for completing instructions in addition to increasing the game score. Our agent significantly outperforms Deep Q-Networks (DQNs), Asynchronous Advantage Actor-Critic (A3C) agents, and the best agents posted to OpenAI Gym on what is often considered the hardest Atari 2600 environment: Montezuma's Revenge.

Beating Atari with Natural Language Guided Reinforcement Learning

TL;DR

This work introduces an Instructed Reinforcement Learning framework that couples natural language guidance with multimodal frame–text embeddings to learn tasks in Atari, notably Montezuma's Revenge. By awarding instruction-completion rewards and enriching the RL signal with language-derived features, the agent surpasses strong baselines (DQN, A3C) and open-playground benchmarks, while demonstrating generalization of learned frame–command mappings across rooms. The approach addresses sparse rewards and stateful environments by leveraging human-provided instructions and robust language–vision alignment, with promising implications for real-world robotics and human–AI collaboration. The paper also provides a dataset and architectural blueprint for learning and deploying frame–command embeddings in dynamic, multimodal settings.

Abstract

We introduce the first deep reinforcement learning agent that learns to beat Atari games with the aid of natural language instructions. The agent uses a multimodal embedding between environment observations and natural language to self-monitor progress through a list of English instructions, granting itself reward for completing instructions in addition to increasing the game score. Our agent significantly outperforms Deep Q-Networks (DQNs), Asynchronous Advantage Actor-Critic (A3C) agents, and the best agents posted to OpenAI Gym on what is often considered the hardest Atari 2600 environment: Montezuma's Revenge.

Paper Structure

This paper contains 17 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Left: An agent exploring the first room of Montezuma's Revenge. Right: An example of the list of natural language instructions one might give the agent. The agent grants itself an additional reward after completing the current instruction. "Completion" is learned by training a generalized multimodal embedding between game images and text.
  • Figure 2: Deep Q-Network performance on all Atari 2600 games, normalized against a human expert. The bottom-most game is Montezuma's Revenge. After playing it for 200 hours, DQN does no better than a random agent and scores no points. DQNNature
  • Figure 3: Reinforcement learning cycle.
  • Figure 4: The overall architecture of our natural language instructed agent at reinforcement learning time---described as the second step above. The agent's input state at a given frame is shown on the left, which consists of four recent frames---the last two frames and the 5th and 9th prior frames---and the current natural language instruction. As in a standard deep reinforcement learning agent, the state is run through a convolutional neural network and then fully connected policy and value networks---shown in blue---to produce an action and update. The multimodal embedding between frame pairs and instructions---trained in the first step above and shown in green---is used to determine if a natural language instruction has been satisfied by the past two frames. Satisfying an instruction moves the agent on to the next and leads to the agent giving itself a small additional reward. The frame and instruction sentence embedding are also passed as additional features to the network learning the policy and value. Intuitively, this equates to telling the agent (1) what is next expected of it, rather than leaving it to have to explore blindly for the next reward, and (2) how its progress is being measured against that command. Together these allow it to better generalize which actions are required to satisfy a given command.
  • Figure 5: Injecting additional reward in Breakout for keeping the paddle under the ball speeds initial learning greatly (left); however, once the agent masters the instructions, which describe only basic game play, the instructions cease to speed up learning (right). Having mastered Breakout, a more difficult environment---Montezuma's Revenge---is required for further reinforcement learning insight.
  • ...and 1 more figures