Table of Contents
Fetching ...

Discovering Reinforcement Learning Interfaces with Large Language Models

Akshat Singh Jaswal, Ashish Baghel, Paras Chopra

Abstract

Reinforcement learning systems rely on environment interfaces that specify observations and reward functions, yet constructing these interfaces for new tasks often requires substantial manual effort. While recent work has automated reward design using large language models (LLMs), these approaches assume fixed observations and do not address the broader challenge of synthesizing complete task interfaces. We study RL task interface discovery from raw simulator state, where both observation mappings and reward functions must be generated. We propose LIMEN (Code available at https://github.com/Lossfunk/LIMEN), a LLM guided evolutionary framework that produces candidate interfaces as executable programs and iteratively refines them using policy training feedback. Across novel discrete gridworld tasks and continuous control domains spanning locomotion and manipulation, joint evolution of observations and rewards discovers effective interfaces given only a trajectory-level success metric, while optimizing either component alone fails on at least one domain. These results demonstrate that automatic construction of RL interfaces from raw state can substantially reduce manual engineering and that observation and reward components often benefit from co-design, as single-component optimization fails catastrophically on at least one domain in our evaluation suite.

Discovering Reinforcement Learning Interfaces with Large Language Models

Abstract

Reinforcement learning systems rely on environment interfaces that specify observations and reward functions, yet constructing these interfaces for new tasks often requires substantial manual effort. While recent work has automated reward design using large language models (LLMs), these approaches assume fixed observations and do not address the broader challenge of synthesizing complete task interfaces. We study RL task interface discovery from raw simulator state, where both observation mappings and reward functions must be generated. We propose LIMEN (Code available at https://github.com/Lossfunk/LIMEN), a LLM guided evolutionary framework that produces candidate interfaces as executable programs and iteratively refines them using policy training feedback. Across novel discrete gridworld tasks and continuous control domains spanning locomotion and manipulation, joint evolution of observations and rewards discovers effective interfaces given only a trajectory-level success metric, while optimizing either component alone fails on at least one domain. These results demonstrate that automatic construction of RL interfaces from raw state can substantially reduce manual engineering and that observation and reward components often benefit from co-design, as single-component optimization fails catastrophically on at least one domain in our evaluation suite.

Paper Structure

This paper contains 83 sections, 10 equations, 7 figures, 16 tables, 1 algorithm.

Figures (7)

  • Figure 1: Overview of the LIMEN framework. The outer loop performs evolutionary search: LIMEN selects a parent interface from the MAP-Elites archive, mutates it via LLM-guided code generation, and evaluates the resulting interface by training an RL agent in the inner loop. The interface $(\phi, R)$ mediates between the raw simulator state and the agent, defining the observations and rewards that constitute the induced MDP. Fitness is measured by trajectory-level task success and fed back to update the archive.
  • Figure 2: Evaluation environments. Top: XLand-MiniGrid tasks of increasing compositional complexity— (a) object pickup among distractors, (b) relational placement, (c) multi-step rule chain across rooms. Bottom: MuJoCo tasks— (d) quadruped push recovery, (e) manipulator trajectory tracking.
  • Figure 3: Evolution progress of LIMEN showing candidate interfaces, crash events, and improvements in the running best success rate across iterations.
  • Figure 4: Learning curves for LIMEN and ablations across five tasks. Success rate versus environment steps (millions), averaged over 10 seeds with shaded standard deviation. Joint interface discovery consistently achieves higher performance than observation-only, reward-only, and sparse baselines.
  • Figure 5: Independent LLM samples (no evolution) versus the best interface found by LIMEN across four tasks. Each dot is a single interface sampled from the LLM with the same prompt and evaluated over 3 seeds with identical training budgets.
  • ...and 2 more figures