Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

Haozhen Zhang; Haodong Yue; Tao Feng; Quanyu Long; Jianzhu Bao; Bowen Jin; Weizhi Zhang; Xiao Li; Jiaxuan You; Chengwei Qin; Wenya Wang

Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

Haozhen Zhang, Haodong Yue, Tao Feng, Quanyu Long, Jianzhu Bao, Bowen Jin, Weizhi Zhang, Xiao Li, Jiaxuan You, Chengwei Qin, Wenya Wang

TL;DR

BudgetMem tackles the challenge of memory management for LLM agents by enabling on-demand, query-aware memory extraction through a modular pipeline where each module exposes Low/Mid/High budgets. A lightweight PPO-based router learns to assign budgets per module invocation, optimizing a scalar reward r = r_task + λ · α · r_cost that balances task performance and extraction cost, with costs normalized via a sliding window and aligned by α to stabilize learning. The approach is evaluated on LoCoMo, LongMemEval, and HotpotQA, comparing three tiering strategies—implementation, reasoning, and capacity—and demonstrating improved performance-cost frontiers under budget constraints. The work provides a practical framework for deploying memory-augmented agents with predictable compute and latency budgets while preserving answer quality.

Abstract

Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query-agnostic memory construction that can be inefficient and may discard query-critical information. Although runtime memory utilization is a natural alternative, prior work often incurs substantial overhead and offers limited explicit control over the performance-cost trade-off. In this work, we present \textbf{BudgetMem}, a runtime agent memory framework for explicit, query-aware performance-cost control. BudgetMem structures memory processing as a set of memory modules, each offered in three budget tiers (i.e., \textsc{Low}/\textsc{Mid}/\textsc{High}). A lightweight router performs budget-tier routing across modules to balance task performance and memory construction cost, which is implemented as a compact neural policy trained with reinforcement learning. Using BudgetMem as a unified testbed, we study three complementary strategies for realizing budget tiers: implementation (method complexity), reasoning (inference behavior), and capacity (module model size). Across LoCoMo, LongMemEval, and HotpotQA, BudgetMem surpasses strong baselines when performance is prioritized (i.e., high-budget setting), and delivers better accuracy-cost frontiers under tighter budgets. Moreover, our analysis disentangles the strengths and weaknesses of different tiering strategies, clarifying when each axis delivers the most favorable trade-offs under varying budget regimes.

Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

TL;DR

Abstract

Paper Structure (57 sections, 13 equations, 5 figures, 10 tables)

This paper contains 57 sections, 13 equations, 5 figures, 10 tables.

Related Work
Memory-Augmented LLM Agents
Inference-Time Performance-Cost Trade-offs in LLM Systems
Problem Setup and Method Overview
Inputs.
Outputs.
Budgeted runtime extraction.
BudgetMem
Modular Runtime Memory Pipeline
Pipeline structure.
Module interfaces and intermediate states.
Budget-tier interface (conceptual).
Budget Tiers and Tiering Strategies
Tiering strategies.
Realizing Low/Mid/High.
...and 42 more sections

Figures (5)

Figure 1: BudgetMem overview. Given a user query $q$, we retrieve raw chunks $\mathcal{C}_q$ from a chunked history (without offline memory preprocessing) and process them with a modular pipeline (filter $\rightarrow$ entity/temporal/topic $\rightarrow$ summary). Each module exposes Low/Mid/High budget tiers instantiated by one of three strategies (implementation, reasoning, capacity). A shared lightweight router selects tiers module-wise based on the query and intermediate states, and is trained with reinforcement learning using task and cost rewards to yield controllable performance--cost trade-offs. ($\mathcal{C}_q$: retrieved raw chunks, $\tilde{\mathcal{C}}_q$: filtered chunks, $e,t,p$: extracted contexts, $m$: extracted memory)
Figure 2: Performance--cost trade-offs across tiering strategies on LoCoMo. By varying the cost weight $\lambda$, BudgetMem traces smooth, controllable frontiers that shift toward higher performance as budget increases, and envelop baselines in both low- and high-cost regimes.
Figure 3: Ablation of reward-scale alignment under capacity tiering strategy on LoCoMo.
Figure 4: Budget-tier selection ratios. Module-wise Low/Mid/High routing ratios on LongMemEval under varying cost weights $\lambda$ using the capacity tiering strategy.
Figure 5: Retrieval-size sensitivity on LoCoMo. Cost and Judge versus the number of retrieved raw chunks, evaluated under all three tiering strategies.

Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

TL;DR

Abstract

Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

Authors

TL;DR

Abstract

Table of Contents

Figures (5)