Table of Contents
Fetching ...

Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents

Aaditya Khanal, Yangyang Tao, Junxiu Zhou

Abstract

Existing benchmarks measure capability -- whether a model succeeds on a single attempt -- but production deployments require reliability -- consistent success across repeated attempts on tasks of varying duration. We show these properties diverge systematically as task duration grows, and that pass@1 on short tasks is structurally blind to this divergence. We introduce a reliability science framework for long-horizon LLM agents with four metrics: Reliability Decay Curve (RDC), Variance Amplification Factor (VAF), Graceful Degradation Score (GDS), and Meltdown Onset Point (MOP). We evaluate 10 models across 23,392 episodes on a 396-task benchmark spanning four duration buckets and three domains. Key findings: (1) reliability decay is domain-stratified -- SE GDS drops from 0.90 to 0.44 while document processing is nearly flat (0.74 to 0.71); (2) VAF bifurcates by capability tier -- high VAF is a capability signature, not an instability signal; (3) capability and reliability rankings diverge substantially, with multi-rank inversions at long horizons; (4) frontier models have the highest meltdown rates (up to 19%) because they attempt ambitious multi-step strategies that sometimes spiral; and (5) memory scaffolds universally hurt long-horizon performance across all 10 models. These results motivate reliability as a first-class evaluation dimension alongside capability.

Beyond pass@1: A Reliability Science Framework for Long-Horizon LLM Agents

Abstract

Existing benchmarks measure capability -- whether a model succeeds on a single attempt -- but production deployments require reliability -- consistent success across repeated attempts on tasks of varying duration. We show these properties diverge systematically as task duration grows, and that pass@1 on short tasks is structurally blind to this divergence. We introduce a reliability science framework for long-horizon LLM agents with four metrics: Reliability Decay Curve (RDC), Variance Amplification Factor (VAF), Graceful Degradation Score (GDS), and Meltdown Onset Point (MOP). We evaluate 10 models across 23,392 episodes on a 396-task benchmark spanning four duration buckets and three domains. Key findings: (1) reliability decay is domain-stratified -- SE GDS drops from 0.90 to 0.44 while document processing is nearly flat (0.74 to 0.71); (2) VAF bifurcates by capability tier -- high VAF is a capability signature, not an instability signal; (3) capability and reliability rankings diverge substantially, with multi-rank inversions at long horizons; (4) frontier models have the highest meltdown rates (up to 19%) because they attempt ambitious multi-step strategies that sometimes spiral; and (5) memory scaffolds universally hurt long-horizon performance across all 10 models. These results motivate reliability as a first-class evaluation dimension alongside capability.

Paper Structure

This paper contains 99 sections, 11 equations, 4 figures, 16 tables.

Figures (4)

  • Figure 1: Reliability Decay Curves (pass@1 vs. duration bucket) for all 10 models, ReAct scaffold. The frontier cluster (DeepSeek V3, Kimi K2.5, MiniMax M2.5) maintains $\geq 79\%$ at very-long; all other models show steeper declines. GLM-4.5 Air is a "leaky frontier" — highest short pass@1 but steep long-horizon drop. Llama 3.3 70B shows non-monotone recovery at very-long.
  • Figure 2: VAF vs. long+very-long pass@1 for all 10 models. The bifurcation is visible as two clusters: frontier models (top-right, VAF $\geq 2.37$, pass $\geq 72.7\%$) and mid/small models (bottom-left). No model occupies the high-VAF/low-pass regime, confirming that high variance amplification requires high long-horizon performance.
  • Figure 3: ReAct vs. Memory scaffold long+very-long GDS for all 10 models. Every bar is at or below zero: the memory scaffold never improves long-horizon reliability. Kimi K2.5 and Mistral 24B show the largest penalties ($-0.14$ and $-0.13$ respectively).
  • Figure 4: Per-model pass@1 heatmap by domain and duration bucket (ReAct scaffold). SE tiles darken steeply from short to very-long for all models. DP tiles remain bright for frontier models. GLM-4.5 Air's WR weakness is visible as an anomalously dark WR column relative to its bright SE and DP columns.

Theorems & Definitions (6)

  • Definition 1: Pass@1
  • Definition 2: Pass$^k$
  • Definition 3: Reliability Decay Curve
  • Definition 4: Variance Amplification Factor
  • Definition 5: Graceful Degradation Score
  • Definition 6: Meltdown Onset Point