Table of Contents
Fetching ...

GRACE: A Language Model Framework for Explainable Inverse Reinforcement Learning

Silvia Sapora, Devon Hjelm, Alexander Toshev, Omar Attia, Bogdan Mazoure

TL;DR

GRACE reframes reward design for IRL by producing executable, interpretable reward programs via an LLM-guided evolutionary search. By identifying goal states from expert trajectories, evolving Python-based rewards, and actively collecting data through PPO, it achieves strong policy performance in BabyAI and AndroidWorld with minimal supervision. The code-based rewards not only offer transparency and verifiability but also naturally form modular APIs that support multi-task generalization. Empirical results show GRACE outperforms traditional IRL (e.g., GAIL) under limited demonstrations and demonstrates robust shaping and reuse capabilities, highlighting practical impact for interpretable RL in diverse domains.

Abstract

Inverse Reinforcement Learning aims to recover reward models from expert demonstrations, but traditional methods yield "black-box" models that are difficult to interpret and debug. In this work, we introduce GRACE (Generating Rewards As CodE), a method for using Large Language Models within an evolutionary search to reverse-engineer an interpretable, code-based reward function directly from expert trajectories. The resulting reward function is executable code that can be inspected and verified. We empirically validate GRACE on the BabyAI and AndroidWorld benchmarks, where it efficiently learns highly accurate rewards, even in complex, multi-task settings. Further, we demonstrate that the resulting reward leads to strong policies, compared to both competitive Imitation Learning and online RL approaches with ground-truth rewards. Finally, we show that GRACE is able to build complex reward APIs in multi-task setups.

GRACE: A Language Model Framework for Explainable Inverse Reinforcement Learning

TL;DR

GRACE reframes reward design for IRL by producing executable, interpretable reward programs via an LLM-guided evolutionary search. By identifying goal states from expert trajectories, evolving Python-based rewards, and actively collecting data through PPO, it achieves strong policy performance in BabyAI and AndroidWorld with minimal supervision. The code-based rewards not only offer transparency and verifiability but also naturally form modular APIs that support multi-task generalization. Empirical results show GRACE outperforms traditional IRL (e.g., GAIL) under limited demonstrations and demonstrates robust shaping and reuse capabilities, highlighting practical impact for interpretable RL in diverse domains.

Abstract

Inverse Reinforcement Learning aims to recover reward models from expert demonstrations, but traditional methods yield "black-box" models that are difficult to interpret and debug. In this work, we introduce GRACE (Generating Rewards As CodE), a method for using Large Language Models within an evolutionary search to reverse-engineer an interpretable, code-based reward function directly from expert trajectories. The resulting reward function is executable code that can be inspected and verified. We empirically validate GRACE on the BabyAI and AndroidWorld benchmarks, where it efficiently learns highly accurate rewards, even in complex, multi-task settings. Further, we demonstrate that the resulting reward leads to strong policies, compared to both competitive Imitation Learning and online RL approaches with ground-truth rewards. Finally, we show that GRACE is able to build complex reward APIs in multi-task setups.

Paper Structure

This paper contains 40 sections, 2 theorems, 6 equations, 8 figures, 6 tables, 1 algorithm.

Key Result

Proposition 1

Suppose $m(s) = 1$ iff $s\in \mathcal{S}_g$, else $m(s) = -1$, then GRACE optimizes, $\min_\pi\max_r J(\pi_E, m\circ r) - J(\pi, -m\circ r)$, which is a variation of Eq. (eq:irl).

Figures (8)

  • Figure 1: Overview of the GRACE framework. (a) The expert, negative and extra data is used to identify goal states. (b) The goal and non-goal states are used to generate reward functions through an evolutionary procedure. The rewards are iteratively refined by feeding the examples misclassified by the reward. (c) An agent is trained with online RL using the converged reward; the data it sees during the training is classified by the LLM into $\mathcal{D}^{+},\mathcal{D}^{-}$ and used to further improve the reward.
  • Figure 2: Fitness vs Number of Expert Trajectories. The fitness is computed on test dataset after obtaining maximum fitness on training data with corresponding number of expert and negative training trajectories. (a) Performance on all 20 BabyAI tasks. (b) Aggregate fitness across 20 BabyAI tasks.
  • Figure 3: Fitness vs Number of generations. Evolution of train and test fitness across evolution generations, as defined by Algorithm \ref{['alg:grace']}, for BabyAI (multi-level settings) and AndroidControl (bottom) for "set alarm" task. We provide 8 expert trajectories and 8 negative trajectories for each task. Shading is standard deviation across 3 seeds.
  • Figure 4: Training Curves for AndroidWorld Clock Tasks. Mean episode success over the 3 AndroidWorld clock tasks: ClockStopWatchPausedVerify, ClockStopWatchRunning, and ClockTimerEntry.
  • Figure 5: Shaping Using the default reward recovered by GRACE occasionally leads to failure in learning the correct behavior due to poor shaping. Through the targeted shaping in Phase 3, we significantly improve final performance and speed of learning.
  • ...and 3 more figures

Theorems & Definitions (3)

  • Proposition 1
  • Proposition 2
  • proof