AgentArch: A Comprehensive Benchmark to Evaluate Agent Architectures in Enterprise
Tara Bogavelli, Roshnee Sharma, Hari Subramani
TL;DR
AgentArch provides an end-to-end benchmark to evaluate how four architectural dimensions—orchestration strategy, prompting style (function calling vs ReAct), memory design, and thinking tool integration—interact within enterprise task workflows across 18 configurations and six LLMs. The study demonstrates substantial model-specific preferences and a lack of universally optimal designs, with top performance limited to 70.8% on a simple task and 35.3% on a complex one, highlighting reliability gaps in current agentic systems. Key contributions include the first systematic Enterprise benchmark across multi-agent and single-agent setups, detailed analysis of how architectural choices interact with model capabilities, and evidence-based guidance for model and configuration selection in enterprise environments. The findings underscore the need for empirically grounded design decisions to improve reliability, integration with business processes, and task-specific performance in real-world deployments, while also outlining concrete directions for future research, including more diverse tasks, multimodal data, and efficiency considerations. $k=8$ attempts per configuration underpin reliability reporting, and the best-performing configurations still reveal substantial room for improvement in end-to-end enterprise automation.
Abstract
While individual components of agentic architectures have been studied in isolation, there remains limited empirical understanding of how different design dimensions interact within complex multi-agent systems. This study aims to address these gaps by providing a comprehensive enterprise-specific benchmark evaluating 18 distinct agentic configurations across state-of-the-art large language models. We examine four critical agentic system dimensions: orchestration strategy, agent prompt implementation (ReAct versus function calling), memory architecture, and thinking tool integration. Our benchmark reveals significant model-specific architectural preferences that challenge the prevalent one-size-fits-all paradigm in agentic AI systems. It also reveals significant weaknesses in overall agentic performance on enterprise tasks with the highest scoring models achieving a maximum of only 35.3\% success on the more complex task and 70.8\% on the simpler task. We hope these findings inform the design of future agentic systems by enabling more empirically backed decisions regarding architectural components and model selection.
