Game On: Towards Language Models as RL Experimenters

Jingwei Zhang; Thomas Lampe; Abbas Abdolmaleki; Jost Tobias Springenberg; Martin Riedmiller

Game On: Towards Language Models as RL Experimenters

Jingwei Zhang, Thomas Lampe, Abbas Abdolmaleki, Jost Tobias Springenberg, Martin Riedmiller

TL;DR

The paper addresses the challenge of automating reinforcement learning experiment workflows for embodied agents by introducing a Vision-Language Model (VLM) driven architecture that handles task proposal, decomposition, and progress analysis. It implements a zero-shot Gemini-based prototype with a fixed low-level skill library and trains a text-conditioned offline PAC policy, using VLM-guided data collection to improve learning and expand the skill repertoire. In a robotic block-stacking benchmark, the approach yields richer data diversity, enables self-improvement through additional skills, and demonstrates progressively more complex task decomposition guided by the evolving skill library. Limitations include the absence of automated reward modeling and fully automatic stopping, with future work focused on integrating LLM-based rewards, automatic skill addition, dynamic skill durations, and end-to-end automation across the RL loop.

Abstract

We propose an agent architecture that automates parts of the common reinforcement learning experiment workflow, to enable automated mastery of control domains for embodied agents. To do so, it leverages a VLM to perform some of the capabilities normally required of a human experimenter, including the monitoring and analysis of experiment progress, the proposition of new tasks based on past successes and failures of the agent, decomposing tasks into a sequence of subtasks (skills), and retrieval of the skill to execute - enabling our system to build automated curricula for learning. We believe this is one of the first proposals for a system that leverages a VLM throughout the full experiment cycle of reinforcement learning. We provide a first prototype of this system, and examine the feasibility of current models and techniques for the desired level of automation. For this, we use a standard Gemini model, without additional fine-tuning, to provide a curriculum of skills to a language-conditioned Actor-Critic algorithm, in order to steer data collection so as to aid learning new skills. Data collected in this way is shown to be useful for learning and iteratively improving control policies in a robotics domain. Additional examination of the ability of the system to build a growing library of skills, and to judge the progress of the training of those skills, also shows promising results, suggesting that the proposed architecture provides a potential recipe for fully automated mastery of tasks and domains for embodied agents.

Game On: Towards Language Models as RL Experimenters

TL;DR

Abstract

Paper Structure (39 sections, 5 figures, 2 tables)

This paper contains 39 sections, 5 figures, 2 tables.

Introduction
Related Work
LLM-based Virtual Agents
LLM/VLM-based Embodied Agents
System Architecture
The Curriculum Module
Task proposition.
Task decomposition.
Skill retrieval.
The Embodiment Module
The Analysis Module
System Realization
Module Interaction
Policy Training
Prompting
...and 24 more sections

Figures (5)

Figure 1: Illustration of the system architecture and the interaction of its components. The curriculum module generates free-text propositions, decomposes them into free-text steps, and tries to map those onto fixed-text skills from a library. We note in the current implementation if the retrieval step (map free-text steps onto fixed-text skills) failed then the plan will be discarded, as we do not have access to a reward model for generating arbitrary reward signals from skill captions; once such a reward model becomes available the failed retrieval should signal the training of a new skill. The generated skill sequence is executed by a text-conditioned policy of the embodiment module and unrolled into an episode, which is used to improve the policy. Performance during policy training is evaluated by the analysis module, which judges whether training has converged and skills should be added to the library.
Figure 2: Screenshot of a Google Meet session hosting the agent when performing a multi-robot simulation experiment, with boxes annotating the different modules. Note that the curriculum module mirrors the image stream of one of the embodiment modules it currently attends to.
Figure 3: Training curves for PAC self-improvement. Adding the self-improvement set (red) consistently outperforms using only the pretraining set (green), both on the average of all tasks (top left), and on all individual task families. Adding then a third set of data collected with the newly added pyramid skills, produces even better performance (blue).
Figure 4: Example evaluation curves provided to the VLM, at different numbers of learning steps during training. Color coding denotes whether a curve was judged as converged (green) or not yet converged (red).
Figure 5: Cumulative percentage of curves judged as converged by the analysis module, per task family.

Game On: Towards Language Models as RL Experimenters

TL;DR

Abstract

Game On: Towards Language Models as RL Experimenters

Authors

TL;DR

Abstract

Table of Contents

Figures (5)