Table of Contents
Fetching ...

Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy

Aman Mehta

Abstract

As LLM-based agents are deployed in production systems, understanding their behavioral consistency (whether they produce similar action sequences when given identical tasks) becomes critical for reliability. We study consistency in the context of SWE-bench, a challenging software engineering benchmark requiring complex, multi-step reasoning. Comparing Claude~4.5~Sonnet, GPT-5, and Llama-3.1-70B across 50 runs each (10 tasks $\times$ 5 runs), we find that across models, higher consistency aligns with higher accuracy: Claude achieves the lowest variance (CV: 15.2\%) and highest accuracy (58\%), GPT-5 is intermediate (CV: 32.2\%, accuracy: 32\%), and Llama shows the highest variance (CV: 47.0\%) with lowest accuracy (4\%). However, within a model, consistency can amplify both correct and incorrect interpretations. Our analysis reveals a critical nuance: \textbf{consistency amplifies outcomes rather than guaranteeing correctness}. 71\% of Claude's failures stem from "consistent wrong interpretation": making the same incorrect assumption across all runs. Interestingly, GPT-5 achieves similar early strategic agreement as Claude (diverging at step 3.4 vs.\ 3.2) but exhibits 2.1$\times$ higher variance, suggesting that divergence timing alone does not determine consistency. These findings suggest that for production deployment, interpretation accuracy matters more than execution consistency, with implications for agent evaluation and training.

Consistency Amplifies: How Behavioral Variance Shapes Agent Accuracy

Abstract

As LLM-based agents are deployed in production systems, understanding their behavioral consistency (whether they produce similar action sequences when given identical tasks) becomes critical for reliability. We study consistency in the context of SWE-bench, a challenging software engineering benchmark requiring complex, multi-step reasoning. Comparing Claude~4.5~Sonnet, GPT-5, and Llama-3.1-70B across 50 runs each (10 tasks 5 runs), we find that across models, higher consistency aligns with higher accuracy: Claude achieves the lowest variance (CV: 15.2\%) and highest accuracy (58\%), GPT-5 is intermediate (CV: 32.2\%, accuracy: 32\%), and Llama shows the highest variance (CV: 47.0\%) with lowest accuracy (4\%). However, within a model, consistency can amplify both correct and incorrect interpretations. Our analysis reveals a critical nuance: \textbf{consistency amplifies outcomes rather than guaranteeing correctness}. 71\% of Claude's failures stem from "consistent wrong interpretation": making the same incorrect assumption across all runs. Interestingly, GPT-5 achieves similar early strategic agreement as Claude (diverging at step 3.4 vs.\ 3.2) but exhibits 2.1 higher variance, suggesting that divergence timing alone does not determine consistency. These findings suggest that for production deployment, interpretation accuracy matters more than execution consistency, with implications for agent evaluation and training.

Paper Structure

This paper contains 63 sections, 1 equation, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Behavioral consistency comparison across three models. Claude achieves the lowest coefficient of variation, followed by GPT-5, then Llama. Individual task CVs shown as points.
  • Figure 2: Step count heatmap across all 150 runs. Claude (left) shows uniform coloring; GPT-5 (middle) shows low counts with moderate variance; Llama (right) shows high variance.
  • Figure 3: Per-task consistency (CV) vs accuracy across three models. No significant within-model correlation exists. Consistency does not predict accuracy at the task level.
  • Figure 4: Per-task accuracy comparison across three models. Claude achieves 58% overall, GPT-5 32%, and Llama 4%. Note that Llama outperforms both on astropy-13236.
  • Figure 5: Phase decomposition across three models. Claude invests heavily in UNDERSTAND, GPT-5 emphasizes VERIFY, and Llama spends more on EXPLORE.
  • ...and 3 more figures