Table of Contents
Fetching ...

EigenData: A Self-Evolving Multi-Agent Platform for Function-Calling Data Synthesis, Auditing, and Repair

Jiaao Chen, Jingyuan Qi, Mingye Gao, Wei-Chen Wang, Hanrui Wang, Di Jin

TL;DR

This work applies EigenData, an integrated, self-evolving platform that automates the full data lifecycle through a multi-agent architecture, to audit and repair the Berkeley Function-Calling Leaderboard, and demonstrates that the repaired benchmark produces model rankings substantially better correlated with human judgments of functional correctness.

Abstract

Function-calling agents -- large language models that invoke tools and APIs -- require high-quality, domain-specific training data spanning executable environments, backing databases, and diverse multi-turn trajectories. We introduce EigenData, an integrated, self-evolving platform that automates the full data lifecycle through a multi-agent architecture. A top-level orchestrator, EigenCore, coordinates three specialized sub-systems: DatabaseAgent for realistic domain database construction, CodingAgent for verified executable environment generation with iterative test-debug loops, and DataAgent for multi-turn trajectory synthesis with self-evolving prompt optimization. Cross-component feedback ensures consistency across all artifacts. We apply EigenData to audit and repair the Berkeley Function-Calling Leaderboard (BFCL-V3), identifying systematic errors in function schemas, implementations, and reference trajectories, automatically correcting them through coordinated schema refinement, code-level bug fixes, and trajectory modification, and introducing an outcome-aware evaluation protocol that assesses task success via database-state correctness rather than turn-level trajectory matching. We demonstrate that the repaired benchmark, coupled with outcome-aware metrics, produces model rankings substantially better correlated with human judgments of functional correctness.

EigenData: A Self-Evolving Multi-Agent Platform for Function-Calling Data Synthesis, Auditing, and Repair

TL;DR

This work applies EigenData, an integrated, self-evolving platform that automates the full data lifecycle through a multi-agent architecture, to audit and repair the Berkeley Function-Calling Leaderboard, and demonstrates that the repaired benchmark produces model rankings substantially better correlated with human judgments of functional correctness.

Abstract

Function-calling agents -- large language models that invoke tools and APIs -- require high-quality, domain-specific training data spanning executable environments, backing databases, and diverse multi-turn trajectories. We introduce EigenData, an integrated, self-evolving platform that automates the full data lifecycle through a multi-agent architecture. A top-level orchestrator, EigenCore, coordinates three specialized sub-systems: DatabaseAgent for realistic domain database construction, CodingAgent for verified executable environment generation with iterative test-debug loops, and DataAgent for multi-turn trajectory synthesis with self-evolving prompt optimization. Cross-component feedback ensures consistency across all artifacts. We apply EigenData to audit and repair the Berkeley Function-Calling Leaderboard (BFCL-V3), identifying systematic errors in function schemas, implementations, and reference trajectories, automatically correcting them through coordinated schema refinement, code-level bug fixes, and trajectory modification, and introducing an outcome-aware evaluation protocol that assesses task success via database-state correctness rather than turn-level trajectory matching. We demonstrate that the repaired benchmark, coupled with outcome-aware metrics, produces model rankings substantially better correlated with human judgments of functional correctness.
Paper Structure (61 sections, 15 figures, 3 tables, 1 algorithm)

This paper contains 61 sections, 15 figures, 3 tables, 1 algorithm.

Figures (15)

  • Figure 1: Overview of the EigenData platform. EigenCore receives user requests and orchestrates three specialized sub-systems: DatabaseAgent for domain database construction, CodingAgent for environment and tool generation, and DataAgent for multi-turn trajectory synthesis. Artifacts flow left to right through the canonical pipeline, with cross-component feedback enabling targeted repair without full-pipeline restarts.
  • Figure 2: Internal architecture of the DatabaseAgent. The pipeline proceeds from input specification through schema design, validation, data population (with constraint-aware sampling, distribution modeling, and edge-case injection), and consistency verification. A targeted regeneration loop handles any constraint violations detected during verification.
  • Figure 3: Internal architecture of the DataAgent. The Orchestration Layer (top) coordinates planning, prompt engineering, and quality evaluation through a self-optimizing feedback loop. The Execution Layer (bottom) comprises specialized worker agents organized into three stages: seed and plan generation, dialogue synthesis, and processing and validation. A two-phase self-evolving process first optimizes prompts on a small batch (Phase 1), then scales to full production with continuous monitoring (Phase 2).
  • Figure 4: Representative workflows instantiated by the DataAgent's Workflow Planner. Each row shows a distinct pipeline: Data Audit systematically diagnoses quality issues and failure modes; Schema Polish iteratively refines API specifications via verification--modification loops; Data Repair performs targeted trajectory repair by segmenting conversations into safe, affected, and candidate zones; Data Generation Without Tool Graphs produces multi-turn conversations top-down from sampled user intents; Data Generation With Tool Graphs generates grounded conversations bottom-up from executed tool-call plans; Schema-Triggered Patch incrementally reconstructs only the trajectory portions affected by schema evolution.
  • Figure 5: Example of a function schema error in BFCL. The schema declares ticket_id as integer, but the ground truth and backing data use the string "ticket_001"---a model following the schema cannot succeed.
  • ...and 10 more figures