Table of Contents
Fetching ...

Structured Distillation for Personalized Agent Memory: 11x Token Reduction with Retrieval Preservation

Sydney Lewis

Abstract

Long conversations with an AI agent create a simple problem for one user: the history is useful, but carrying it verbatim is expensive. We study personalized agent memory: one user's conversation history with an agent, distilled into a compact retrieval layer for later search. Each exchange is compressed into a compound object with four fields (exchange_core, specific_context, thematic room_assignments, and regex-extracted files_touched). The searchable distilled text averages 38 tokens per exchange. Applied to 4,182 conversations (14,340 exchanges) from 6 software engineering projects, the method reduces average exchange length from 371 to 38 tokens, yielding 11x compression. We evaluate whether personalized recall survives that compression using 201 recall-oriented queries, 107 configurations spanning 5 pure and 5 cross-layer search modes, and 5 LLM graders (214,519 consensus-graded query-result pairs). The best pure distilled configuration reaches 96% of the best verbatim MRR (0.717 vs 0.745). Results are mechanism-dependent. All 20 vector search configurations remain non-significant after Bonferroni correction, while all 20 BM25 configurations degrade significantly (effect sizes |d|=0.031-0.756). The best cross-layer setup slightly exceeds the best pure verbatim baseline (MRR 0.759). Structured distillation compresses single-user agent memory without uniformly sacrificing retrieval quality. At 1/11 the context cost, thousands of exchanges fit within a single prompt while the verbatim source remains available for drill-down. We release the implementation and analysis pipeline as open-source software.

Structured Distillation for Personalized Agent Memory: 11x Token Reduction with Retrieval Preservation

Abstract

Long conversations with an AI agent create a simple problem for one user: the history is useful, but carrying it verbatim is expensive. We study personalized agent memory: one user's conversation history with an agent, distilled into a compact retrieval layer for later search. Each exchange is compressed into a compound object with four fields (exchange_core, specific_context, thematic room_assignments, and regex-extracted files_touched). The searchable distilled text averages 38 tokens per exchange. Applied to 4,182 conversations (14,340 exchanges) from 6 software engineering projects, the method reduces average exchange length from 371 to 38 tokens, yielding 11x compression. We evaluate whether personalized recall survives that compression using 201 recall-oriented queries, 107 configurations spanning 5 pure and 5 cross-layer search modes, and 5 LLM graders (214,519 consensus-graded query-result pairs). The best pure distilled configuration reaches 96% of the best verbatim MRR (0.717 vs 0.745). Results are mechanism-dependent. All 20 vector search configurations remain non-significant after Bonferroni correction, while all 20 BM25 configurations degrade significantly (effect sizes |d|=0.031-0.756). The best cross-layer setup slightly exceeds the best pure verbatim baseline (MRR 0.759). Structured distillation compresses single-user agent memory without uniformly sacrificing retrieval quality. At 1/11 the context cost, thousands of exchanges fit within a single prompt while the verbatim source remains available for drill-down. We release the implementation and analysis pipeline as open-source software.
Paper Structure (58 sections, 6 figures, 7 tables)

This paper contains 58 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: MRR heatmap by mode and retrieval mechanism (best fusion per cell). Darker cells indicate higher MRR.
  • Figure 2: Effect sizes (Cohen's $d$) for per-mechanism comparisons (verbatim vs each distilled mode, averaging across fusion strategies within each mode) with 95% bootstrap confidence intervals. Grouped by mechanism.
  • Figure 3: Distillation impact by query type. Heatmap showing mean grade delta relative to pooled verbatim baseline; negative values (red) indicate degradation, positive values (blue) indicate improvement.
  • Figure 4: Grade composition by search mode. Bars show the proportion of pooled results at grades 0--3 for each mode. Grade 3 is the largest bucket in every mode, but the overall profiles remain similar, with most mass concentrated in grades 1 and 3.
  • Figure 5: Pairwise proportion agreement heatmap among 5 LLM graders. Values show the fraction of items where two graders assigned identical grades. Cohen's $\kappa$ values (Table \ref{['tab:agreement-summary']} and Appendix \ref{['app:full-results-tables']}) are lower due to chance correction.
  • ...and 1 more figures