Table of Contents
Fetching ...

Instruction Following with Goal-Conditioned Reinforcement Learning in Virtual Environments

Zoya Volovikova, Alexey Skrynnik, Petr Kuderov, Aleksandr I. Panov

TL;DR

IGOR addresses instruction following in virtual environments by decoupling language understanding from action execution. It uses a Language Module to translate natural language into subtasks, a Task Manager to track progress and reward subtasks, and a PPO-based Policy Module to execute actions within a POMDP framework, with a dedicated curriculum to guide learning. Across IGLU and Crafter, IGOR outperforms strong baselines (including the IGLU 2022 winners and Dynalang) and benefits from data augmentation, primitive subtask representations, and curriculum learning. This modular separation enables flexible integration of environment-specific techniques and shows promise for scalable instruction-following in multimodal embodied AI applications, with significant practical implications for complex task execution in virtual settings. $\mathbb{E}$-style reinforcement learning objectives and curated curricula underpin robust policy optimization, contributing to improved generalization and sample efficiency.$

Abstract

In this study, we address the issue of enabling an artificial intelligence agent to execute complex language instructions within virtual environments. In our framework, we assume that these instructions involve intricate linguistic structures and multiple interdependent tasks that must be navigated successfully to achieve the desired outcomes. To effectively manage these complexities, we propose a hierarchical framework that combines the deep language comprehension of large language models with the adaptive action-execution capabilities of reinforcement learning agents. The language module (based on LLM) translates the language instruction into a high-level action plan, which is then executed by a pre-trained reinforcement learning agent. We have demonstrated the effectiveness of our approach in two different environments: in IGLU, where agents are instructed to build structures, and in Crafter, where agents perform tasks and interact with objects in the surrounding environment according to language commands.

Instruction Following with Goal-Conditioned Reinforcement Learning in Virtual Environments

TL;DR

IGOR addresses instruction following in virtual environments by decoupling language understanding from action execution. It uses a Language Module to translate natural language into subtasks, a Task Manager to track progress and reward subtasks, and a PPO-based Policy Module to execute actions within a POMDP framework, with a dedicated curriculum to guide learning. Across IGLU and Crafter, IGOR outperforms strong baselines (including the IGLU 2022 winners and Dynalang) and benefits from data augmentation, primitive subtask representations, and curriculum learning. This modular separation enables flexible integration of environment-specific techniques and shows promise for scalable instruction-following in multimodal embodied AI applications, with significant practical implications for complex task execution in virtual settings. -style reinforcement learning objectives and curated curricula underpin robust policy optimization, contributing to improved generalization and sample efficiency.$

Abstract

In this study, we address the issue of enabling an artificial intelligence agent to execute complex language instructions within virtual environments. In our framework, we assume that these instructions involve intricate linguistic structures and multiple interdependent tasks that must be navigated successfully to achieve the desired outcomes. To effectively manage these complexities, we propose a hierarchical framework that combines the deep language comprehension of large language models with the adaptive action-execution capabilities of reinforcement learning agents. The language module (based on LLM) translates the language instruction into a high-level action plan, which is then executed by a pre-trained reinforcement learning agent. We have demonstrated the effectiveness of our approach in two different environments: in IGLU, where agents are instructed to build structures, and in Crafter, where agents perform tasks and interact with objects in the surrounding environment according to language commands.
Paper Structure (31 sections, 1 equation, 11 figures, 6 tables, 1 algorithm)

This paper contains 31 sections, 1 equation, 11 figures, 6 tables, 1 algorithm.

Figures (11)

  • Figure 1: The task of collaborative interaction between the agent, the environment, and the user involves the following: the user provides instructions to the agent, and the agent executes actions within the environment to accomplish the task based on these instructions.
  • Figure 2: The IGOR framework has three modules: a Language module that solves language understanding problems and provides a high-level plan of subtasks, a Task Manager that encodes the subtasks for the Policy module, and a Policy module that executes actions in the environment based on visual observations and subtasks.
  • Figure 3: The diagram displays the IGOR system, where the "Language Module" transforms text instructions into subtasks. The "Task Manager" coordinates the subtasks and monitors their execution. The "Policy Module" operates in a virtual environment based on the subtasks. Dotted lines indicate the training process of the modules, while solid lines show how the modules interact during inference.
  • Figure 4: IGLU is a 3D environment where agents are tasked with constructing structures in a designated area, guided by descriptions provided in natural language and the agent’s first person perspective.
  • Figure 5: Crafter is a 2D environment reminiscent of Minecraft, where players must gather food and water, acquire resources, fend off creatures, and construct tools.
  • ...and 6 more figures