Table of Contents
Fetching ...

MIRA: Memory-Integrated Reinforcement Learning Agent with Limited LLM Guidance

Narjes Nourzad, Carlee Joe-Wong

TL;DR

This work proposes MIRA (Memory-Integrated Reinforcement Learning Agent), which incorporates a structured, evolving memory graph to guide early training, and provides theoretical analysis showing that utility-based shaping improves early-stage learning in sparse-reward environments.

Abstract

Reinforcement learning (RL) agents often suffer from high sample complexity in sparse or delayed reward settings due to limited prior structure. Large language models (LLMs) can provide subgoal decompositions, plausible trajectories, and abstract priors that facilitate early learning. However, heavy reliance on LLM supervision introduces scalability constraints and dependence on potentially unreliable signals. We propose MIRA (Memory-Integrated Reinforcement Learning Agent), which incorporates a structured, evolving memory graph to guide early training. The graph stores decision-relevant information, including trajectory segments and subgoal structures, and is constructed from both the agent's high-return experiences and LLM outputs. This design amortizes LLM queries into a persistent memory rather than requiring continuous real-time supervision. From this memory graph, we derive a utility signal that softly adjusts advantage estimation to influence policy updates without modifying the underlying reward function. As training progresses, the agent's policy gradually surpasses the initial LLM-derived priors, and the utility term decays, preserving standard convergence guarantees. We provide theoretical analysis showing that utility-based shaping improves early-stage learning in sparse-reward environments. Empirically, MIRA outperforms RL baselines and achieves returns comparable to approaches that rely on frequent LLM supervision, while requiring substantially fewer online LLM queries. Project webpage: https://narjesno.github.io/MIRA/

MIRA: Memory-Integrated Reinforcement Learning Agent with Limited LLM Guidance

TL;DR

This work proposes MIRA (Memory-Integrated Reinforcement Learning Agent), which incorporates a structured, evolving memory graph to guide early training, and provides theoretical analysis showing that utility-based shaping improves early-stage learning in sparse-reward environments.

Abstract

Reinforcement learning (RL) agents often suffer from high sample complexity in sparse or delayed reward settings due to limited prior structure. Large language models (LLMs) can provide subgoal decompositions, plausible trajectories, and abstract priors that facilitate early learning. However, heavy reliance on LLM supervision introduces scalability constraints and dependence on potentially unreliable signals. We propose MIRA (Memory-Integrated Reinforcement Learning Agent), which incorporates a structured, evolving memory graph to guide early training. The graph stores decision-relevant information, including trajectory segments and subgoal structures, and is constructed from both the agent's high-return experiences and LLM outputs. This design amortizes LLM queries into a persistent memory rather than requiring continuous real-time supervision. From this memory graph, we derive a utility signal that softly adjusts advantage estimation to influence policy updates without modifying the underlying reward function. As training progresses, the agent's policy gradually surpasses the initial LLM-derived priors, and the utility term decays, preserving standard convergence guarantees. We provide theoretical analysis showing that utility-based shaping improves early-stage learning in sparse-reward environments. Empirically, MIRA outperforms RL baselines and achieves returns comparable to approaches that rely on frequent LLM supervision, while requiring substantially fewer online LLM queries. Project webpage: https://narjesno.github.io/MIRA/
Paper Structure (57 sections, 4 theorems, 33 equations, 19 figures, 12 tables, 5 algorithms)

This paper contains 57 sections, 4 theorems, 33 equations, 19 figures, 12 tables, 5 algorithms.

Key Result

Theorem 1

Define the shaped surrogate $\mathcal{L}^{{shaped}}(\theta) \doteq \mathbb{E}\!\left[ \nabla_\theta \log \pi_\theta(a_t|s_t)\, \tilde{A}_t \right]$, and the PPO surrogate $\mathcal{L}^{{ppo}}(\pi) = \mathbb{E}\!\left[ \nabla_\theta \log \pi_\theta(a_t|s_t)\, A_t \right].$ Consider a training iterati where $\mathcal{L}_k^{U} \doteq \mathbb{E}\!\left[ \nabla_\theta \log \pi_\theta(a_t|s_t)\, U_t \ri

Figures (19)

  • Figure 1: Offline priors and online LLM suggestions are filtered by a screening unit before being incorporated into the memory graph as healthy grafts. MIRA agent acts under partial observations, interacting with the environment. A utility module evaluates trajectory rollouts against the evolving memory graph, producing a utility signal that shapes advantage estimation and policy updates.
  • Figure 2: MIRA’s evolving memory graph. Trajectory segments $\tau_j$ are grouped under subgoal nodes $\kappa_\ell$. Subgoals can be shared across multiple final goals, enabling reuse of common behaviors.
  • Figure 3: Evaluation environments. Top: RedBall (navigation to target), LavaCrossing (long-horizon navigation with irreversible hazards), DoorKey (sparse reward with key–goal dependency). Bottom: RedBlueDoor (sequence-sensitive toggling), Distracted DoorKey (distractor-rich variant with key-goal dependency).
  • Figure 4: Mean return on FrozenLake (left): Both MIRA variants improve early-stage learning relative to PPO, while PPO eventually attains a comparable asymptotic return. Evolution of shaping terms $\eta_t$, $\xi_t$, and ratio $\delta_t$ (right): $\delta_t$ decays during training, ensuring convergence as $\delta_t \to 0$.
  • Figure 5: Mean return (top) and success rate (bottom) across four MiniGrid and BabyAI tasks. MIRA consistently outperforms both baselines, achieving faster learning, higher asymptotic return, and greater success rates. These results are obtained with a small LLM budget, using fewer than ten offline prompts to build memory graphs plus infrequent online queries to guide exploration.
  • ...and 14 more figures

Theorems & Definitions (8)

  • Theorem 1: Non-Vanishing Updates in Sparse-Reward Regimes
  • Lemma 1: Bounded Shaped Policy Updates
  • proof
  • Theorem 2: Non-Divergence under Trust Region
  • proof
  • Theorem 3: Asymptotic Equivalence to PPO
  • proof
  • proof