Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals

Yue Wu; Yewen Fan; Paul Pu Liang; Amos Azaria; Yuanzhi Li; Tom M. Mitchell

Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals

Yue Wu, Yewen Fan, Paul Pu Liang, Amos Azaria, Yuanzhi Li, Tom M. Mitchell

TL;DR

This work tackles RL sample inefficiency in Atari by leveraging human-written instruction manuals. It introduces Read and Reward (R&R), a two-module system that uses extractive QA to summarize manuals and zero-shot reasoning to assign auxiliary rewards for detected interactions, integrated with standard RL agents. The approach yields substantial gains (e.g., ~60% improvement with 1000x fewer frames on Skiing) and generalizes across official and Wikipedia manuals in an end-to-end setting. This work demonstrates the feasibility of injecting human prior knowledge from unstructured text into RL and paves the way for broader use of manuals to accelerate learning in visually rich tasks.

Abstract

High sample complexity has long been a challenge for RL. On the other hand, humans learn to perform tasks not only from interaction or demonstrations, but also by reading unstructured text documents, e.g., instruction manuals. Instruction manuals and wiki pages are among the most abundant data that could inform agents of valuable features and policies or task-specific environmental dynamics and reward structures. Therefore, we hypothesize that the ability to utilize human-written instruction manuals to assist learning policies for specific tasks should lead to a more efficient and better-performing agent. We propose the Read and Reward framework. Read and Reward speeds up RL algorithms on Atari games by reading manuals released by the Atari game developers. Our framework consists of a QA Extraction module that extracts and summarizes relevant information from the manual and a Reasoning module that evaluates object-agent interactions based on information from the manual. An auxiliary reward is then provided to a standard A2C RL agent, when interaction is detected. Experimentally, various RL algorithms obtain significant improvement in performance and training speed when assisted by our design.

Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals

TL;DR

Abstract

Paper Structure (32 sections, 7 figures, 5 tables)

This paper contains 32 sections, 7 figures, 5 tables.

Introduction
Background
Reducing sample complexity in RL
Grounding objects for control problems
Reinforcement learning informed by natural language
RL Models that Read Natural Language Instructions
Read and Reward
QA Extraction Module (Read)
Zero-shot reasoning with pre-trained QA model (Reward)
Experiments
Atari Environment and Baselines
Delayed Reward Schedule
Grounding Objects in Atari
Full end-to-end pipeline on Skiing with A2C
Ground-truth object localization/grounding experiments
...and 17 more sections

Figures (7)

Figure 1: An overview of our Read and Reward framework. Our system receives the current frame in the environment, and the instruction manual as input. After object detection and grounding, the QA Extraction Module extracts and summarizes relevant information from the manual, and the Reasoning Module assigns auxiliary rewards to detected in-game events by reasoning with outputs from the QA Extraction Module. The "Yes/No" answers are then mapped to $+5/-5$ auxiliary rewards.
Figure 2: Illustration of the QA Extraction Module on the game PacMan. We obtain generic information about the game by running extractive QA on 4 generic questions (3 shown since one question did not have an answer). We then obtain object-specific information using a question template. We concatenate generic and object-specific information to obtain the $<$context$>$ string.
Figure 3: Illustration of the Reasoning Module. The $<$context$>$ (from Figure \ref{['fig:QA_extraction']}) related to the object ghost from the QA Extraction module is concatenated with a template-generated question to form the zero-shot in-context reasoning prompt for a Large Language Model. The Yes/No answer from the LLM is then turned into an auxiliary reward for the agent.
Figure 4: Examples of SPACE lin2020space and CLIP radford2021learning in the full end-to-end pipeline (Section \ref{['skiing_pipeline']}). The top row shows bounding boxes for objects and the bottom row shows corresponding object masks as detected by SPACE. Most of the bounding boxes generated are correct. Left: SPACE confuses bounding boxes of agent and tree into one and the box gets classified as "tree" (blue), and the auxiliary penalty is not properly triggered. Right: The flag next to the agent (in red circle) is not detected, and therefore the auxiliary reward is not provided.
Figure 4: Table showing the number of training steps required by Agent57 with Read and Reward (Official) to reach the same performance as the Agent57 baseline at 1e6 training steps and the speedup ratio, under delayed-reward.
...and 2 more figures

Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals

TL;DR

Abstract

Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals

Authors

TL;DR

Abstract

Table of Contents

Figures (7)