Table of Contents
Fetching ...

An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications

Mohammed Mehedi Hasan, Hao Li, Emad Fallahzadeh, Gopi Krishnan Rajbahadur, Bram Adams, Ahmed E. Hassan

Abstract

Foundation model (FM)-based AI agents are rapidly gaining adoption across diverse domains, but their inherent non-determinism and non-reproducibility pose testing and quality assurance challenges. While recent benchmarks provide task-level evaluations, there is limited understanding of how developers verify the internal correctness of these agents during development. To address this gap, we conduct the first large-scale empirical study of testing practices in the AI agent ecosystem, analyzing 39 open-source agent frameworks and 439 agentic applications. We identify ten distinct testing patterns and find that novel, agent-specific methods like DeepEval are seldom used (around 1%), while traditional patterns like negative and membership testing are widely adapted to manage FM uncertainty. By mapping these patterns to canonical architectural components of agent frameworks and agentic applications, we uncover a fundamental inversion of testing effort: deterministic components like Resource Artifacts (tools) and Coordination Artifacts (workflows) consume over 70% of testing effort, while the FM-based Plan Body receives less than 5%. Crucially, this reveals a critical blind spot, as the Trigger component (prompts) remains neglected, appearing in around 1% of all tests. Our findings offer the first empirical testing baseline in FM-based agent frameworks and agentic applications, revealing a rational but incomplete adaptation to non-determinism. To address it, framework developers should improve support for novel testing methods, application developers must adopt prompt regression testing, and researchers should explore barriers to adoption. Strengthening these practices is vital for building more robust and dependable AI agents.

An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications

Abstract

Foundation model (FM)-based AI agents are rapidly gaining adoption across diverse domains, but their inherent non-determinism and non-reproducibility pose testing and quality assurance challenges. While recent benchmarks provide task-level evaluations, there is limited understanding of how developers verify the internal correctness of these agents during development. To address this gap, we conduct the first large-scale empirical study of testing practices in the AI agent ecosystem, analyzing 39 open-source agent frameworks and 439 agentic applications. We identify ten distinct testing patterns and find that novel, agent-specific methods like DeepEval are seldom used (around 1%), while traditional patterns like negative and membership testing are widely adapted to manage FM uncertainty. By mapping these patterns to canonical architectural components of agent frameworks and agentic applications, we uncover a fundamental inversion of testing effort: deterministic components like Resource Artifacts (tools) and Coordination Artifacts (workflows) consume over 70% of testing effort, while the FM-based Plan Body receives less than 5%. Crucially, this reveals a critical blind spot, as the Trigger component (prompts) remains neglected, appearing in around 1% of all tests. Our findings offer the first empirical testing baseline in FM-based agent frameworks and agentic applications, revealing a rational but incomplete adaptation to non-determinism. To address it, framework developers should improve support for novel testing methods, application developers must adopt prompt regression testing, and researchers should explore barriers to adoption. Strengthening these practices is vital for building more robust and dependable AI agents.

Paper Structure

This paper contains 76 sections, 3 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Overview of the Research Method
  • Figure 2: Distribution of agent application categories in our dataset. Each slice shows the proportion of agents assigned to a high-level category. Categories with more than one instance are reported individually: Generic Assistant (application, workflow manager or builder), Software Maintenance (monitoring, cloudops, review, testing, inferencing), Software Development (coding, documentation), Medicine, Crypto, and Project Management. All single-instance categories are aggregated into Others comprising assistants capable of data analysis, memory, blogger, ecommerce, hrmis, physics, and research.
  • Figure 3: A sample test function marked with Arrange--Act--Assert blocks.
  • Figure 4: Overview of testing patterns observed in agent frameworks and applications. The three structural patterns (highlighted in blue) and seven verification patterns (highlighted in green) are organized under the top-level testing patterns. Elliptical nodes represent sub-patterns grouped under their corresponding high-level categories.
  • Figure 5: Example DeepEval test case that verifies whether the retrieved output includes only the correct company name. Pattern starts by triggering assert_test, and GEval is invoked subsequently. Evaluation parameters, threshold, and validation steps are configured in the highlighted lines.
  • ...and 5 more figures