Table of Contents
Fetching ...

Just-in-time Episodic Feedback Hinter: Leveraging Offline Knowledge to Improve LLM Agents Adaptation

Hadi Nekoei, Aman Jaiswal, Patrice Bechard, Oleh Shliazhko, Orlando Marquez Ayala, Mathieu Reymond, Massimo Caccia, Alexandre Drouin, Sarath Chandar, Alexandre Lacoste

TL;DR

The paper tackles the challenge of enhancing LLM agents in unfamiliar domains without costly online interactions or fine-tuning by distilling offline trajectories into lightweight, context-aware hints. It introduces Just-in-time Episodic Feedback Hinter (JEF Hinter), which uses a zooming module to identify decisive decision points and a reflection step to generate concise hints, capable of leveraging both successful and failed traces. At inference, a retriever selects relevant hints to condition the agent's actions, enabling targeted guidance with transparency and no additional training. Empirical results on MiniWoB++, WorkArena-L1, and WebArena-Lite show that JEF Hinter outperforms strong baselines, including document- and human-based hints, and demonstrates robust generalization to unseen tasks and goals.

Abstract

Large language model (LLM) agents perform well in sequential decision-making tasks, but improving them on unfamiliar domains often requires costly online interactions or fine-tuning on large expert datasets. These strategies are impractical for closed-source models and expensive for open-source ones, with risks of catastrophic forgetting. Offline trajectories offer reusable knowledge, yet demonstration-based methods struggle because raw traces are long, noisy, and tied to specific tasks. We present Just-in-time Episodic Feedback Hinter (JEF Hinter), an agentic system that distills offline traces into compact, context-aware hints. A zooming mechanism highlights decisive steps in long trajectories, capturing both strategies and pitfalls. Unlike prior methods, JEF Hinter leverages both successful and failed trajectories, extracting guidance even when only failure data is available, while supporting parallelized hint generation and benchmark-independent prompting. At inference, a retriever selects relevant hints for the current state, providing targeted guidance with transparency and traceability. Experiments on MiniWoB++, WorkArena-L1, and WebArena-Lite show that JEF Hinter consistently outperforms strong baselines, including human- and document-based hints.

Just-in-time Episodic Feedback Hinter: Leveraging Offline Knowledge to Improve LLM Agents Adaptation

TL;DR

The paper tackles the challenge of enhancing LLM agents in unfamiliar domains without costly online interactions or fine-tuning by distilling offline trajectories into lightweight, context-aware hints. It introduces Just-in-time Episodic Feedback Hinter (JEF Hinter), which uses a zooming module to identify decisive decision points and a reflection step to generate concise hints, capable of leveraging both successful and failed traces. At inference, a retriever selects relevant hints to condition the agent's actions, enabling targeted guidance with transparency and no additional training. Empirical results on MiniWoB++, WorkArena-L1, and WebArena-Lite show that JEF Hinter outperforms strong baselines, including document- and human-based hints, and demonstrates robust generalization to unseen tasks and goals.

Abstract

Large language model (LLM) agents perform well in sequential decision-making tasks, but improving them on unfamiliar domains often requires costly online interactions or fine-tuning on large expert datasets. These strategies are impractical for closed-source models and expensive for open-source ones, with risks of catastrophic forgetting. Offline trajectories offer reusable knowledge, yet demonstration-based methods struggle because raw traces are long, noisy, and tied to specific tasks. We present Just-in-time Episodic Feedback Hinter (JEF Hinter), an agentic system that distills offline traces into compact, context-aware hints. A zooming mechanism highlights decisive steps in long trajectories, capturing both strategies and pitfalls. Unlike prior methods, JEF Hinter leverages both successful and failed trajectories, extracting guidance even when only failure data is available, while supporting parallelized hint generation and benchmark-independent prompting. At inference, a retriever selects relevant hints for the current state, providing targeted guidance with transparency and traceability. Experiments on MiniWoB++, WorkArena-L1, and WebArena-Lite show that JEF Hinter consistently outperforms strong baselines, including human- and document-based hints.

Paper Structure

This paper contains 51 sections, 2 equations, 11 figures, 6 tables, 2 algorithms.

Figures (11)

  • Figure 1: Average episodic reward versus test-time evaluation cost of JEF Hinter on MiniWoB++, WorkArena-L1, and WebArena-Lite, using GPT-5-mini as the Hinter model. Colors and markers denote different methods, while marker size reflects the base LLM model size.. JEF Hinter achieves substantial gains over baselines, incurring only slightly higher cost than the original ReAct yao2023react agent while being far more efficient than Autoguide$^\dagger$fu2024autoguide.
  • Figure 2: Overview of the JEF Hinter. (1) Collect Traces:JEF Hinter operates over heterogeneous offline trajectories, including both successful (green) and failed (red) runs, allowing the system to capture not only effective behaviors but also common pitfalls. (2) Zoom and Reflect: A zooming module selects critical steps within each trace, and the hinter reflects on these segments to distill them into concise, reusable natural language hints. Each hint is paired with a semantic key summarizing its context and stored for retrieval. (3) Retrieve and Act: At inference time, the agent generates a query (goal- or context-conditioned) which is matched against the database of semantic keys. The most relevant hints are retrieved and injected into the agent's context, guiding its actions. This process unifies knowledge distillation, reflection, and retrieval, supporting both in-task reliability and out-of-task generalization.
  • Figure 3: Web browsing benchmarks considered in our work: MiniWob++ liu2018reinforcement, WorkArena-L1 drouin2024workarena, and WebArena-Lite zhou2024webarenaliu2025visualagentbench.
  • Figure 4: Average reward comparison across MiniWoB++, WorkArena-L1, and WebArena-Lite using two base models with GPT-5-mini as the Hinter model. JEF Hinter and JEF Hinter(w/o zoom) consistently outperform all baselines across most tasks, highlighting the effectiveness of our approach. Shaded regions denote tasks where the base ReAct agent failed entirely, highlighting JEF Hinter’s ability to extract useful hints even from failure-only trajectories.
  • Figure 5: Out-of-task generalization performance on WorkArena-L1 and WebArena-Lite using two base models with GPT-5-mini as the base for the hinter model.
  • ...and 6 more figures