Table of Contents
Fetching ...

APEX-EM: Non-Parametric Online Learning for Autonomous Agents via Structured Procedural-Episodic Experience Replay

Pratyay Banerjee, Masud Moshtaghi, Ankit Chadha

Abstract

LLM-based autonomous agents lack persistent procedural memory: they re-derive solutions from scratch even when structurally identical tasks have been solved before. We present \textbf{APEX-EM}, a non-parametric online learning framework that accumulates, retrieves, and reuses structured procedural plans without modifying model weights. APEX-EM introduces: (1) a \emph{structured experience representation} encoding the full procedural-episodic trace of each execution -- planning steps, artifacts, iteration history with error analysis, and quality scores; (2) a \emph{Plan-Retrieve-Generate-Iterate-Ingest} (PRGII) workflow with Task Verifiers providing multi-dimensional reward signals; and (3) a \emph{dual-outcome Experience Memory} with hybrid retrieval combining semantic search, structural signature matching, and plan DAG traversal -- enabling cross-domain transfer between tasks sharing no lexical overlap but analogous operational structure. Successful experiences serve as positive in-context examples; failures as negative examples with structured error annotations. We evaluate on BigCodeBench~\cite{zhuo2025bigcodebench}, KGQAGen-10k~\cite{zhang2025kgqagen}, and Humanity's Last Exam~\cite{phan2025hle} using Claude Sonnet 4.5 and Opus 4.5. On KGQAGen-10k, APEX-EM achieves 89.6\% accuracy versus 41.3\% without memory (+48.3pp), surpassing the oracle-retrieval upper bound (84.9\%). On BigCodeBench, it reaches 83.3\% SR from a 53.9\% baseline (+29.4pp), exceeding MemRL's~\cite{memrl2025} +11.0pp gain under comparable frozen-backbone conditions (noting backbone differences controlled for in our analysis). On HLE, entity graph retrieval reaches 48.0\% from 25.2\% (+22.8pp). Ablations show component value is task-dependent: rich judge feedback is negligible for code generation but critical for structured queries (+10.3pp), while binary-signal iteration partially compensates for weaker feedback.

APEX-EM: Non-Parametric Online Learning for Autonomous Agents via Structured Procedural-Episodic Experience Replay

Abstract

LLM-based autonomous agents lack persistent procedural memory: they re-derive solutions from scratch even when structurally identical tasks have been solved before. We present \textbf{APEX-EM}, a non-parametric online learning framework that accumulates, retrieves, and reuses structured procedural plans without modifying model weights. APEX-EM introduces: (1) a \emph{structured experience representation} encoding the full procedural-episodic trace of each execution -- planning steps, artifacts, iteration history with error analysis, and quality scores; (2) a \emph{Plan-Retrieve-Generate-Iterate-Ingest} (PRGII) workflow with Task Verifiers providing multi-dimensional reward signals; and (3) a \emph{dual-outcome Experience Memory} with hybrid retrieval combining semantic search, structural signature matching, and plan DAG traversal -- enabling cross-domain transfer between tasks sharing no lexical overlap but analogous operational structure. Successful experiences serve as positive in-context examples; failures as negative examples with structured error annotations. We evaluate on BigCodeBench~\cite{zhuo2025bigcodebench}, KGQAGen-10k~\cite{zhang2025kgqagen}, and Humanity's Last Exam~\cite{phan2025hle} using Claude Sonnet 4.5 and Opus 4.5. On KGQAGen-10k, APEX-EM achieves 89.6\% accuracy versus 41.3\% without memory (+48.3pp), surpassing the oracle-retrieval upper bound (84.9\%). On BigCodeBench, it reaches 83.3\% SR from a 53.9\% baseline (+29.4pp), exceeding MemRL's~\cite{memrl2025} +11.0pp gain under comparable frozen-backbone conditions (noting backbone differences controlled for in our analysis). On HLE, entity graph retrieval reaches 48.0\% from 25.2\% (+22.8pp). Ablations show component value is task-dependent: rich judge feedback is negligible for code generation but critical for structured queries (+10.3pp), while binary-signal iteration partially compensates for weaker feedback.

Paper Structure

This paper contains 74 sections, 9 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: The PRGII workflow. A task $t$ flows left-to-right through five phases: Plan (task understanding, entity/schema discovery, structural signature extraction), Retrieve (semantic search, structural signature matching, and PKG traversal with dual-track separation into positive $P^+$ and negative $P^-$ examples), Generate (artifact generation conditioned on $\mathcal{P}(t)$, $P^+$, $P^-$, with reflection layers steering generation), Iterate (execute, validate via $\mathcal{V}_d$, refine on failure), and Ingest (Teacher evaluation, quality gate, commit to Experience Memory $\mathcal{M}$ via Add or Merge). The PKG supports entity resolution during planning and experience retrieval during retrieval. Both entity and experience nodes are updated during ingestion, closing the online learning loop.
  • Figure 2: Cross-domain structural transfer.
  • Figure 3: Epoch-by-epoch success rate across all ablation configurations. Left: BigCodeBench (Sonnet 4.5, 798 tasks). A5 (Opus judge) separates early and leads throughout at 83.3% (E10); all configs complete. Center: KGQAGen-10k (Sonnet 4.5, 2K tasks). Massive first-epoch jump from 41.3% baseline to $\sim$72--75%; A2 (rich feedback) reaches 85.9% vs A1 (binary) at 75.6% by E9. Right: HLE (Opus 4.5, 500 tasks). A3 (entity graph) leads at 48.0% (E10); A4 (no judge) plateaus at $\sim$19% despite high CSR (47.9%). All HLE ablations complete.
  • Figure 4: Iteration distribution shift from E1 to E10. Left: BCB EG2---first-attempt solutions increase from 63.5% to 76.8% (+13.3pp). Right: KGQA A3---first-attempt solutions increase from 34.2% to 63.1% (+28.9pp), while tasks hitting the maximum (10 iterations) decrease from 16.3% to 8.6%.
  • Figure 5: BCB Entity Graph. Tasks A ("filter and group CSV data") and B ("compute rolling average of stock prices") share entities (pandas, DataFrame) and procedural steps, enabling structural retrieval despite different descriptions.
  • ...and 5 more figures