Table of Contents
Fetching ...

AgentArch: A Comprehensive Benchmark to Evaluate Agent Architectures in Enterprise

Tara Bogavelli, Roshnee Sharma, Hari Subramani

TL;DR

AgentArch provides an end-to-end benchmark to evaluate how four architectural dimensions—orchestration strategy, prompting style (function calling vs ReAct), memory design, and thinking tool integration—interact within enterprise task workflows across 18 configurations and six LLMs. The study demonstrates substantial model-specific preferences and a lack of universally optimal designs, with top performance limited to 70.8% on a simple task and 35.3% on a complex one, highlighting reliability gaps in current agentic systems. Key contributions include the first systematic Enterprise benchmark across multi-agent and single-agent setups, detailed analysis of how architectural choices interact with model capabilities, and evidence-based guidance for model and configuration selection in enterprise environments. The findings underscore the need for empirically grounded design decisions to improve reliability, integration with business processes, and task-specific performance in real-world deployments, while also outlining concrete directions for future research, including more diverse tasks, multimodal data, and efficiency considerations. $k=8$ attempts per configuration underpin reliability reporting, and the best-performing configurations still reveal substantial room for improvement in end-to-end enterprise automation.

Abstract

While individual components of agentic architectures have been studied in isolation, there remains limited empirical understanding of how different design dimensions interact within complex multi-agent systems. This study aims to address these gaps by providing a comprehensive enterprise-specific benchmark evaluating 18 distinct agentic configurations across state-of-the-art large language models. We examine four critical agentic system dimensions: orchestration strategy, agent prompt implementation (ReAct versus function calling), memory architecture, and thinking tool integration. Our benchmark reveals significant model-specific architectural preferences that challenge the prevalent one-size-fits-all paradigm in agentic AI systems. It also reveals significant weaknesses in overall agentic performance on enterprise tasks with the highest scoring models achieving a maximum of only 35.3\% success on the more complex task and 70.8\% on the simpler task. We hope these findings inform the design of future agentic systems by enabling more empirically backed decisions regarding architectural components and model selection.

AgentArch: A Comprehensive Benchmark to Evaluate Agent Architectures in Enterprise

TL;DR

AgentArch provides an end-to-end benchmark to evaluate how four architectural dimensions—orchestration strategy, prompting style (function calling vs ReAct), memory design, and thinking tool integration—interact within enterprise task workflows across 18 configurations and six LLMs. The study demonstrates substantial model-specific preferences and a lack of universally optimal designs, with top performance limited to 70.8% on a simple task and 35.3% on a complex one, highlighting reliability gaps in current agentic systems. Key contributions include the first systematic Enterprise benchmark across multi-agent and single-agent setups, detailed analysis of how architectural choices interact with model capabilities, and evidence-based guidance for model and configuration selection in enterprise environments. The findings underscore the need for empirically grounded design decisions to improve reliability, integration with business processes, and task-specific performance in real-world deployments, while also outlining concrete directions for future research, including more diverse tasks, multimodal data, and efficiency considerations. attempts per configuration underpin reliability reporting, and the best-performing configurations still reveal substantial room for improvement in end-to-end enterprise automation.

Abstract

While individual components of agentic architectures have been studied in isolation, there remains limited empirical understanding of how different design dimensions interact within complex multi-agent systems. This study aims to address these gaps by providing a comprehensive enterprise-specific benchmark evaluating 18 distinct agentic configurations across state-of-the-art large language models. We examine four critical agentic system dimensions: orchestration strategy, agent prompt implementation (ReAct versus function calling), memory architecture, and thinking tool integration. Our benchmark reveals significant model-specific architectural preferences that challenge the prevalent one-size-fits-all paradigm in agentic AI systems. It also reveals significant weaknesses in overall agentic performance on enterprise tasks with the highest scoring models achieving a maximum of only 35.3\% success on the more complex task and 70.8\% on the simpler task. We hope these findings inform the design of future agentic systems by enabling more empirically backed decisions regarding architectural components and model selection.

Paper Structure

This paper contains 17 sections, 1 equation, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Orchestration, Isolated Agents. Agents ask the orchestrator for help, and the orchestrator selects agent to assist.
  • Figure 2: Orchestration, Open Agent Network. Agents ask other agents for help directly.
  • Figure 3: Pass@1 Acceptable Rate. Heatmap cells report the Pass@1 Acceptable score (higher/darker is better) for each model–configuration pair on the two enterprise tasks: Time Off (TO) and Customer Routing (CR). Pass@1 is computed over $k=8$ attempts per configuration. Overall, function-calling configurations tend to outperform ReAct, thinking tools often help on the simpler TO task, and no single architecture dominates across models or tasks.
  • Figure 4: Pass@1 Hallucination Rate ($\downarrow$). Heatmap cells show the percentage of trials (lower is better, darker indicates higher error) in which an agent hallucinated non-existent entities or schema, e.g., selecting a tool or agent not present in the registry or inventing parameters not defined by a tool schema. Rates are computed at Pass@1 over $k=8$ attempts per model–configuration on both tasks (TO, CR). Hallucinations tend to concentrate in ReAct settings, especially in multi-agent orchestration.
  • Figure 5: Pass@1 Correct Final Decision Rate. Heatmap cells report the probability (higher/ darker is better) that a model–configuration returns the correct ground-truth outcome (approve/deny, route/escalate, etc.). Computed at Pass@1 over $k=8$ attempts on both tasks (TO, CR). Together with Fig. \ref{['fig:your-label']} (Acceptable), this highlights a trade-off between decision accuracy and end-to-end execution reliability.
  • ...and 4 more figures