Table of Contents
Fetching ...

How Do LLMs Fail In Agentic Scenarios? A Qualitative Analysis of Success and Failure Scenarios of Various LLMs in Agentic Simulations

JV Roig

TL;DR

The paper investigates why LLMs fail as autonomous agents in KAMI v0.1 scenarios by analyzing 900 traces across Granite 4 Small, Llama 4 Maverick, and DeepSeek V3.1. It reveals that model scale does not predict agentic robustness and identifies four recurring failure archetypes: premature grounding, over-helpfulness, context pollution, and fragile long-horizon execution. The work argues for agentic evaluation focused on interactive grounding, verification, and recovery, showing that post-training alignment and structured feedback drive reliability more than architecture or size. It offers emergent principles and practical recommendations for enterprise deployment, emphasizing grounding, error-driven adaptation, and context curation as core design challenges.

Abstract

We investigate how large language models (LLMs) fail when operating as autonomous agents with tool-use capabilities. Using the Kamiwaza Agentic Merit Index (KAMI) v0.1 benchmark, we analyze 900 execution traces from three representative models - Granite 4 Small, Llama 4 Maverick, and DeepSeek V3.1 - across filesystem, text extraction, CSV analysis, and SQL scenarios. Rather than focusing on aggregate scores, we perform fine-grained, per-trial behavioral analysis to surface the strategies that enable successful multi-step tool execution and the recurrent failure modes that undermine reliability. Our findings show that model scale alone does not predict agentic robustness: Llama 4 Maverick (400B) performs only marginally better than Granite 4 Small (32B) in some uncertainty-driven tasks, while DeepSeek V3.1's superior reliability derives primarily from post-training reinforcement learning rather than architecture or size. Across models, we identify four recurring failure archetypes: premature action without grounding, over-helpfulness that substitutes missing entities, vulnerability to distractor-induced context pollution, and fragile execution under load. These patterns highlight the need for agentic evaluation methods that emphasize interactive grounding, recovery behavior, and environment-aware adaptation, suggesting that reliable enterprise deployment requires not just stronger models but deliberate training and design choices that reinforce verification, constraint discovery, and adherence to source-of-truth data.

How Do LLMs Fail In Agentic Scenarios? A Qualitative Analysis of Success and Failure Scenarios of Various LLMs in Agentic Simulations

TL;DR

The paper investigates why LLMs fail as autonomous agents in KAMI v0.1 scenarios by analyzing 900 traces across Granite 4 Small, Llama 4 Maverick, and DeepSeek V3.1. It reveals that model scale does not predict agentic robustness and identifies four recurring failure archetypes: premature grounding, over-helpfulness, context pollution, and fragile long-horizon execution. The work argues for agentic evaluation focused on interactive grounding, verification, and recovery, showing that post-training alignment and structured feedback drive reliability more than architecture or size. It offers emergent principles and practical recommendations for enterprise deployment, emphasizing grounding, error-driven adaptation, and context curation as core design challenges.

Abstract

We investigate how large language models (LLMs) fail when operating as autonomous agents with tool-use capabilities. Using the Kamiwaza Agentic Merit Index (KAMI) v0.1 benchmark, we analyze 900 execution traces from three representative models - Granite 4 Small, Llama 4 Maverick, and DeepSeek V3.1 - across filesystem, text extraction, CSV analysis, and SQL scenarios. Rather than focusing on aggregate scores, we perform fine-grained, per-trial behavioral analysis to surface the strategies that enable successful multi-step tool execution and the recurrent failure modes that undermine reliability. Our findings show that model scale alone does not predict agentic robustness: Llama 4 Maverick (400B) performs only marginally better than Granite 4 Small (32B) in some uncertainty-driven tasks, while DeepSeek V3.1's superior reliability derives primarily from post-training reinforcement learning rather than architecture or size. Across models, we identify four recurring failure archetypes: premature action without grounding, over-helpfulness that substitutes missing entities, vulnerability to distractor-induced context pollution, and fragile execution under load. These patterns highlight the need for agentic evaluation methods that emphasize interactive grounding, recovery behavior, and environment-aware adaptation, suggesting that reliable enterprise deployment requires not just stronger models but deliberate training and design choices that reinforce verification, constraint discovery, and adherence to source-of-truth data.

Paper Structure

This paper contains 32 sections, 3 tables.