DDL2PropBank Agent: Benchmarking Multi-Agent Frameworks' Developer Experience Through a Novel Relational Schema Mapping Task

Shafiuddin Rehan Ahmed; Wei Wei

DDL2PropBank Agent: Benchmarking Multi-Agent Frameworks' Developer Experience Through a Novel Relational Schema Mapping Task

Shafiuddin Rehan Ahmed, Wei Wei

TL;DR

DDL2PropBank introduces a principled benchmark to evaluate multi-agent framework developer experience by mapping relational schemas to PropBank rolesets. It uses an Agent-as-a-Tool architecture to implement identical agent logic across 10 frameworks and evaluates code complexity via static analysis and AI-assistability via AI-generated implementations, measured with a Copilot workflow and runtime testing. Results reveal a three-tier complexity spectrum and that Agno combines the lowest complexity with the highest structural alignment and a pass@1 of $83\%$. Structural alignment is predictive for single-canonical-pattern frameworks but overestimates correctness for multi-pattern frameworks; overall Agno is strongest for idiomatic AI-assisted development. The work offers design guidance for framework developers aiming to improve DX and outlines future extensions to broaden semantic grounding and applicability.

Abstract

Multi-agent frameworks promise to simplify LLM-driven software development, yet there is no principled way to evaluate their developer experience in a controlled setting. We introduce DDL2PropBank, a novel benchmark task that maps relational database schemas to PropBank rolesets, requiring autonomous retrieval of candidate frames and fine-grained linguistic reasoning over table names, columns, and relations. Using the Agent-as-a-Tool pattern, we implement identical agent logic across 10 frameworks and evaluate along two dimensions: (i) code complexity via static analysis, and (ii) AI-assistability -- the extent to which LLMs can autonomously generate correct, framework-specific code. Our results reveal a threefold complexity spectrum, with Pydantic AI and Agno requiring the least implementation overhead. For AI-assistability, structural alignment scores reliably proxy runtime success for frameworks with single canonical patterns, but overestimate correctness for multi-pattern frameworks. Agno emerges as the strongest overall performer, combining lowest complexity with highest structural alignment and 83% pass@1.

DDL2PropBank Agent: Benchmarking Multi-Agent Frameworks' Developer Experience Through a Novel Relational Schema Mapping Task

TL;DR

. Structural alignment is predictive for single-canonical-pattern frameworks but overestimates correctness for multi-pattern frameworks; overall Agno is strongest for idiomatic AI-assisted development. The work offers design guidance for framework developers aiming to improve DX and outlines future extensions to broaden semantic grounding and applicability.

Abstract

Paper Structure (102 sections, 10 figures, 10 tables, 3 algorithms)

This paper contains 102 sections, 10 figures, 10 tables, 3 algorithms.

Introduction
DDL2PropBank Agent
Task Definition
Input.
Output.
Reference Agent-as-a-Tool Architecture
Agent Descriptions
Orchestrator.
Coordinator.
Table Mapper.
MCP Integration
PropBank MCP Server (StreamableHTTP).
Filesystem MCP Server (StdIO).
Get Action Verbs function tool.
Ground-truth Implementations
...and 87 more sections

Figures (10)

Figure 1: Top: The Agent-as-a-Tool architecture—an Orchestrator invokes a Coordinator and parallel Table Mapper agents, all accessing shared MCP servers (filesystem and PropBank). Middle: We implement identical logic across 10 MAFs including Claude SDK (Anthropic), Agents SDK (OpenAI), and 8 other open-source frameworks. Bottom: Dual-dimensional developer experience benchmarking.
Figure 2: AI-assisted implementation workflow. Each framework evaluation uses identical inputs: a query template and project context. The assistant reads locked files () and implements database_mapper.py (*).
Figure 3: Project structure for AI-assisted implementation. Files marked are read-only context; the assistant implements only database_mapper.py (*).
Figure 4: Framework comparison across judge score and pass@1 dimensions. The upper-right quadrant represents the sweet spot---frameworks where AI assistants reliably generate both structurally aligned and functionally correct implementations. Dashed lines indicate median score (15.9) and 50% pass@1 threshold.
Figure 5: Orchestrator
...and 5 more figures

DDL2PropBank Agent: Benchmarking Multi-Agent Frameworks' Developer Experience Through a Novel Relational Schema Mapping Task

TL;DR

Abstract

DDL2PropBank Agent: Benchmarking Multi-Agent Frameworks' Developer Experience Through a Novel Relational Schema Mapping Task

Authors

TL;DR

Abstract

Table of Contents

Figures (10)