Beating Atari with Natural Language Guided Reinforcement Learning
Russell Kaplan, Christopher Sauer, Alexander Sosa
TL;DR
This work introduces an Instructed Reinforcement Learning framework that couples natural language guidance with multimodal frame–text embeddings to learn tasks in Atari, notably Montezuma's Revenge. By awarding instruction-completion rewards and enriching the RL signal with language-derived features, the agent surpasses strong baselines (DQN, A3C) and open-playground benchmarks, while demonstrating generalization of learned frame–command mappings across rooms. The approach addresses sparse rewards and stateful environments by leveraging human-provided instructions and robust language–vision alignment, with promising implications for real-world robotics and human–AI collaboration. The paper also provides a dataset and architectural blueprint for learning and deploying frame–command embeddings in dynamic, multimodal settings.
Abstract
We introduce the first deep reinforcement learning agent that learns to beat Atari games with the aid of natural language instructions. The agent uses a multimodal embedding between environment observations and natural language to self-monitor progress through a list of English instructions, granting itself reward for completing instructions in addition to increasing the game score. Our agent significantly outperforms Deep Q-Networks (DQNs), Asynchronous Advantage Actor-Critic (A3C) agents, and the best agents posted to OpenAI Gym on what is often considered the hardest Atari 2600 environment: Montezuma's Revenge.
