Table of Contents
Fetching ...

Beyond Completion: Probing Cumulative State Tracking to Predict LLM Agent Performance

Dengzhe Hou, Lingyu Jiang, Deng Li, Zirui Li, Fangzhou Lin, Kazunori D Yamada

Abstract

Task-completion rate is the standard proxy for LLM agent capability, but models with identical completion scores can differ substantially in their ability to track intermediate state. We introduce Working Memory Fidelity-Active Manipulation (WMF-AM), a calibrated no-scratchpad probe of cumulative arithmetic state tracking, and evaluate it on 20 open-weight models (0.5B-35B, 13 families) against a released deterministic 10-task agent battery. In a pre-specified, Bonferroni-corrected analysis, WMF-AM predicts agent performance with Kendall's tau = 0.612 (p < 0.001, 95% CI [0.360, 0.814]); exploratory partial-tau analyses suggest this signal persists after controlling for completion score and model scale. Three construct-isolation ablations (K = 1 control, non-arithmetic ceiling, yoked cancellation) support the interpretation that cumulative state tracking under load, rather than single-step arithmetic or entity tracking alone, is the primary difficulty source. K-calibration keeps the probe in a discriminative range where prior fixed-depth benchmarks become non-discriminative; generalization beyond this open-weight sample remains open.

Beyond Completion: Probing Cumulative State Tracking to Predict LLM Agent Performance

Abstract

Task-completion rate is the standard proxy for LLM agent capability, but models with identical completion scores can differ substantially in their ability to track intermediate state. We introduce Working Memory Fidelity-Active Manipulation (WMF-AM), a calibrated no-scratchpad probe of cumulative arithmetic state tracking, and evaluate it on 20 open-weight models (0.5B-35B, 13 families) against a released deterministic 10-task agent battery. In a pre-specified, Bonferroni-corrected analysis, WMF-AM predicts agent performance with Kendall's tau = 0.612 (p < 0.001, 95% CI [0.360, 0.814]); exploratory partial-tau analyses suggest this signal persists after controlling for completion score and model scale. Three construct-isolation ablations (K = 1 control, non-arithmetic ceiling, yoked cancellation) support the interpretation that cumulative state tracking under load, rather than single-step arithmetic or entity tracking alone, is the primary difficulty source. K-calibration keeps the probe in a discriminative range where prior fixed-depth benchmarks become non-discriminative; generalization beyond this open-weight sample remains open.

Paper Structure

This paper contains 75 sections, 4 figures, 16 tables.

Figures (4)

  • Figure 1: (a) Completion clusters while WMF-AM spans a wide range ($N{=}20$, 13 families). (b) WMF-AM predicts downstream agent performance ($\tau{=}0.612$, $p{<}0.001$; pre-specified confirmatory). Exploratory: partial $\tau{=}0.411$ ($p{=}0.011$) after controlling for completion (same sample; requires held-out replication). Blue circles = original 7; orange squares = expansion 8; green triangles = small models (0.5B--2B).
  • Figure 2: WMF-AM accuracy by depth $K$ ($N{=}15$, chat template). deepseek-r1:14b (red) maintains near-perfect accuracy; all others degrade sharply at $K \geq 5$. Top 5 models highlighted.
  • Figure 3: CEF taxonomy. Four proposed dimensions (WMF, MCC, EMC, CLA) with sub-dimensions, cognitive anchors, and probe mappings. Only WMF-AM is empirically validated in this paper; other dimensions are included as design context.
  • Figure 4: Validity evidence heatmap. Convergent and divergent correlations across CEF probes and external benchmarks ($N{=}15$). WMF-AM shows high divergent validity (low correlation with completion) and moderate convergent validity with agent battery scores.