Table of Contents
Fetching ...

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling

Liang Ding

Abstract

LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely discarded, wasting the dominant source of collected experience. We introduce AgentHER, a framework that recovers this lost training signal by adapting the Hindsight Experience Replay (HER; Andrychowicz et al., 2017) principle to natural-language agent trajectories for offline data augmentation. The key insight is simple: a trajectory that fails goal A is often a correct demonstration for some achievable alternative goal B. AgentHER realises this idea through a four-stage pipeline -- failure classification, outcome extraction, LLM-guided prompt relabeling with confidence gating, and data packaging -- that converts discarded failures into high-quality SFT, DPO, and ShareGPT training data, with both zero-cost rule-based and LLM-judge implementations. On WebArena (Zhou et al., 2024) and ToolBench (Qin et al., 2024), AgentHER improves over success-only SFT by +7.1-11.7 pp across four model families (GPT-4o, Qwen2.5-72B/7B, LLaMA-3.1-8B), while achieving 2x data efficiency -- matching baseline performance with only 50% of successful demonstrations. Gains are consistent from 1.5B to 72B parameters (+5.8-9.2 pp) and compound under iterative redeployment (+2.1 pp over additional rounds). Human evaluation confirms 97.7% relabeling precision under multi-judge verification.

AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling

Abstract

LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely discarded, wasting the dominant source of collected experience. We introduce AgentHER, a framework that recovers this lost training signal by adapting the Hindsight Experience Replay (HER; Andrychowicz et al., 2017) principle to natural-language agent trajectories for offline data augmentation. The key insight is simple: a trajectory that fails goal A is often a correct demonstration for some achievable alternative goal B. AgentHER realises this idea through a four-stage pipeline -- failure classification, outcome extraction, LLM-guided prompt relabeling with confidence gating, and data packaging -- that converts discarded failures into high-quality SFT, DPO, and ShareGPT training data, with both zero-cost rule-based and LLM-judge implementations. On WebArena (Zhou et al., 2024) and ToolBench (Qin et al., 2024), AgentHER improves over success-only SFT by +7.1-11.7 pp across four model families (GPT-4o, Qwen2.5-72B/7B, LLaMA-3.1-8B), while achieving 2x data efficiency -- matching baseline performance with only 50% of successful demonstrations. Gains are consistent from 1.5B to 72B parameters (+5.8-9.2 pp) and compound under iterative redeployment (+2.1 pp over additional rounds). Human evaluation confirms 97.7% relabeling precision under multi-judge verification.
Paper Structure (54 sections, 3 theorems, 2 equations, 5 figures, 9 tables, 1 algorithm)

This paper contains 54 sections, 3 theorems, 2 equations, 5 figures, 9 tables, 1 algorithm.

Key Result

Proposition 3.1

Let $\pi^*_\mathcal{G}$ be the oracle goal-conditioned policy. Under a perfect judge $\mathcal{J}$ ($c(\hat{g},\tau)=1 \Leftrightarrow \tau$ is a valid demo of $\hat{g}$), every accepted pair $(\hat{g}_i, \tau_i)$ is a correct (goal, trajectory) sample from the support of $\pi^*_\mathcal{G}$.

Figures (5)

  • Figure 1: AgentHER vs. conventional pipelines. Standard training retains only successes ($\checkmark$), discarding 60--75% of data. AgentHER relabels failures ($\circlearrowleft$) with achievable hindsight goals, expanding the effective training corpus ${\approx}3.7\times$.
  • Figure 2: AgentHER four-stage pipeline.Badges above each box show the available implementation mode. Stage 1 classifies failures and discards irrecoverable runs (dashed downward arrow). Stage 3 retries up to three times if the relabeling confidence $c < \theta$ (dashed loop below). Stages 1--2 offer zero-cost rule-based variants; Stage 3 requires an LLM call. Section references (grey) link each stage to its detailed description.
  • Figure 3: Data efficiency and scaling (Qwen2.5-7B, WebArena). (a) AgentHER-SJ at 50% successful demos matches SFT-Success at 100%; AgentHER-MJ exceeds it. (b) Both AgentHER variants scale log-linearly with failure volume; SFT-Success cannot benefit from additional failures. (c) Both variants peak near $\theta^*{=}0.5$; MJ is uniformly better and slightly more robust at low $\theta$.
  • Figure 4: Model-size scaling on WebArena (Qwen2.5 family). Labels above AgentHER-MJ points show $\Delta$ over SFT-Success. Gains are consistent at every scale (1.5B--72B), peaking at 14B (+9.2 pp with MJ). Even the 1.5B model achieves $>$2$\times$ gain over its SFT-Success baseline.
  • Figure 5: Per-failure-type gain (Qwen2.5-7B, WebArena, AgentHER-SJ). Bar colour encodes magnitude: grey/teal = low, blue = high. Incomplete and Constraint_Violation---comprising $\approx$63% of WebArena failures---yield the largest gains. Tool_Error yields the least (+2.1 pp) as crashes leave minimal usable signal.

Theorems & Definitions (7)

  • Definition 3.1: Valid Hindsight Goal
  • Proposition 3.1: Unbiasedness of AgentHER
  • proof : Proof sketch
  • Corollary 3.1.1
  • Remark 1: Empirical sanity check of the bound
  • Theorem 1: Augmented-corpus consistency
  • proof : Proof sketch