Table of Contents
Fetching ...

DDL2PropBank Agent: Benchmarking Multi-Agent Frameworks' Developer Experience Through a Novel Relational Schema Mapping Task

Shafiuddin Rehan Ahmed, Wei Wei

TL;DR

DDL2PropBank introduces a principled benchmark to evaluate multi-agent framework developer experience by mapping relational schemas to PropBank rolesets. It uses an Agent-as-a-Tool architecture to implement identical agent logic across 10 frameworks and evaluates code complexity via static analysis and AI-assistability via AI-generated implementations, measured with a Copilot workflow and runtime testing. Results reveal a three-tier complexity spectrum and that Agno combines the lowest complexity with the highest structural alignment and a pass@1 of $83\%$. Structural alignment is predictive for single-canonical-pattern frameworks but overestimates correctness for multi-pattern frameworks; overall Agno is strongest for idiomatic AI-assisted development. The work offers design guidance for framework developers aiming to improve DX and outlines future extensions to broaden semantic grounding and applicability.

Abstract

Multi-agent frameworks promise to simplify LLM-driven software development, yet there is no principled way to evaluate their developer experience in a controlled setting. We introduce DDL2PropBank, a novel benchmark task that maps relational database schemas to PropBank rolesets, requiring autonomous retrieval of candidate frames and fine-grained linguistic reasoning over table names, columns, and relations. Using the Agent-as-a-Tool pattern, we implement identical agent logic across 10 frameworks and evaluate along two dimensions: (i) code complexity via static analysis, and (ii) AI-assistability -- the extent to which LLMs can autonomously generate correct, framework-specific code. Our results reveal a threefold complexity spectrum, with Pydantic AI and Agno requiring the least implementation overhead. For AI-assistability, structural alignment scores reliably proxy runtime success for frameworks with single canonical patterns, but overestimate correctness for multi-pattern frameworks. Agno emerges as the strongest overall performer, combining lowest complexity with highest structural alignment and 83% pass@1.

DDL2PropBank Agent: Benchmarking Multi-Agent Frameworks' Developer Experience Through a Novel Relational Schema Mapping Task

TL;DR

DDL2PropBank introduces a principled benchmark to evaluate multi-agent framework developer experience by mapping relational schemas to PropBank rolesets. It uses an Agent-as-a-Tool architecture to implement identical agent logic across 10 frameworks and evaluates code complexity via static analysis and AI-assistability via AI-generated implementations, measured with a Copilot workflow and runtime testing. Results reveal a three-tier complexity spectrum and that Agno combines the lowest complexity with the highest structural alignment and a pass@1 of . Structural alignment is predictive for single-canonical-pattern frameworks but overestimates correctness for multi-pattern frameworks; overall Agno is strongest for idiomatic AI-assisted development. The work offers design guidance for framework developers aiming to improve DX and outlines future extensions to broaden semantic grounding and applicability.

Abstract

Multi-agent frameworks promise to simplify LLM-driven software development, yet there is no principled way to evaluate their developer experience in a controlled setting. We introduce DDL2PropBank, a novel benchmark task that maps relational database schemas to PropBank rolesets, requiring autonomous retrieval of candidate frames and fine-grained linguistic reasoning over table names, columns, and relations. Using the Agent-as-a-Tool pattern, we implement identical agent logic across 10 frameworks and evaluate along two dimensions: (i) code complexity via static analysis, and (ii) AI-assistability -- the extent to which LLMs can autonomously generate correct, framework-specific code. Our results reveal a threefold complexity spectrum, with Pydantic AI and Agno requiring the least implementation overhead. For AI-assistability, structural alignment scores reliably proxy runtime success for frameworks with single canonical patterns, but overestimate correctness for multi-pattern frameworks. Agno emerges as the strongest overall performer, combining lowest complexity with highest structural alignment and 83% pass@1.
Paper Structure (102 sections, 10 figures, 10 tables, 3 algorithms)

This paper contains 102 sections, 10 figures, 10 tables, 3 algorithms.

Figures (10)

  • Figure 1: Top: The Agent-as-a-Tool architecture—an Orchestrator invokes a Coordinator and parallel Table Mapper agents, all accessing shared MCP servers (filesystem and PropBank). Middle: We implement identical logic across 10 MAFs including Claude SDK (Anthropic), Agents SDK (OpenAI), and 8 other open-source frameworks. Bottom: Dual-dimensional developer experience benchmarking.
  • Figure 2: AI-assisted implementation workflow. Each framework evaluation uses identical inputs: a query template and project context. The assistant reads locked files () and implements database_mapper.py (*).
  • Figure 3: Project structure for AI-assisted implementation. Files marked are read-only context; the assistant implements only database_mapper.py (*).
  • Figure 4: Framework comparison across judge score and pass@1 dimensions. The upper-right quadrant represents the sweet spot---frameworks where AI assistants reliably generate both structurally aligned and functionally correct implementations. Dashed lines indicate median score (15.9) and 50% pass@1 threshold.
  • Figure 5: Orchestrator
  • ...and 5 more figures