Collaborative Evolving Strategy for Automatic Data-Centric Development

Xu Yang; Haotian Chen; Wenjun Feng; Haoxue Wang; Zeqi Ye; Xinjie Shen; Xiao Yang; Shizhao Sun; Weiqing Liu; Jiang Bian

Collaborative Evolving Strategy for Automatic Data-Centric Development

Xu Yang, Haotian Chen, Wenjun Feng, Haoxue Wang, Zeqi Ye, Xinjie Shen, Xiao Yang, Shizhao Sun, Weiqing Liu, Jiang Bian

TL;DR

This work identifies automated data-centric development as a key but underexplored challenge in accelerating data-driven science. It introduces Co-STEER, an LLM-based autonomous agent composed of a scheduling and an implementation module that co-evolve through practice, leveraging a growing practical knowledge base and feedback retrieval. In experiments on the RD2Bench AD2 benchmark, Co-STEER substantially outperforms state-of-the-art natural-language-to-code baselines in implementation accuracy ($avg corr$), format adherence, and execution success, while the scheduling component further boosts performance. The results demonstrate that collaborative evolution between scheduling and implementation, guided by feedback and a transferable knowledge base, can significantly advance automated data-centric development, albeit with requirements for high-quality data and computational resources.

Abstract

Artificial Intelligence (AI) significantly influences many fields, largely thanks to the vast amounts of high-quality data for machine learning models. The emphasis is now on a data-centric AI strategy, prioritizing data development over model design progress. Automating this process is crucial. In this paper, we serve as the first work to introduce the automatic data-centric development (AD^2) task and outline its core challenges, which require domain-experts-like task scheduling and implementation capability, largely unexplored by previous work. By leveraging the strong complex problem-solving capabilities of large language models (LLMs), we propose an LLM-based autonomous agent, equipped with a strategy named Collaborative Knowledge-STudying-Enhanced Evolution by Retrieval (Co-STEER), to simultaneously address all the challenges. Specifically, our proposed Co-STEER agent enriches its domain knowledge through our proposed evolving strategy and develops both its scheduling and implementation skills by accumulating and retrieving domain-specific practical experience. With an improved schedule, the capability for implementation accelerates. Simultaneously, as implementation feedback becomes more thorough, the scheduling accuracy increases. These two capabilities evolve together through practical feedback, enabling a collaborative evolution process. Extensive experimental results demonstrate that our Co-STEER agent breaks new ground in AD^2 research, possesses strong evolvable schedule and implementation ability, and demonstrates the significant effectiveness of its components. Our Co-STEER paves the way for AD^2 advancements.

Collaborative Evolving Strategy for Automatic Data-Centric Development

TL;DR

), format adherence, and execution success, while the scheduling component further boosts performance. The results demonstrate that collaborative evolution between scheduling and implementation, guided by feedback and a transferable knowledge base, can significantly advance automated data-centric development, albeit with requirements for high-quality data and computational resources.

Abstract

Paper Structure (28 sections, 7 figures, 19 tables)

This paper contains 28 sections, 7 figures, 19 tables.

Introduction
Related Work
Agent Workflows
Agents in Related Scenarios
Co-STEER Agent
Problem Formulation
Overall Design
Scheduling Agent
Implementation Agent
Knowledge Base Design
Feedback Design
Experiments
Datasets
Experimental Settings
Baselines
...and 13 more sections

Figures (7)

Figure 1: A brief illustration of AD2. An agent is expected to understand both the current method and candidate data sources for data selection and preprocessing in the engineering logic expression step.
Figure 2: The detailed design of Co-STEER involves two agents: The scheduling agent inputs candidate tasks for implementation and tries to iteratively make schedules based on various factors; the implementation agent learns knowledge from practice and builds a fine-grained practical knowledge base that can transfer between different tasks; the agents evolve through mutual support.
Figure 3: Visualization of Co-STEER progress.
Figure 4: Scheduling agent learn to schedule based on task info & practical feedbacks.
Figure 5: Scheduling agent response: Task complexity and task dependency are considered for multiple factors to prioritize tasks.
...and 2 more figures

Collaborative Evolving Strategy for Automatic Data-Centric Development

TL;DR

Abstract

Collaborative Evolving Strategy for Automatic Data-Centric Development

Authors

TL;DR

Abstract

Table of Contents

Figures (7)