Multi-Agent Systems for Dataset Adaptation in Software Engineering: Capabilities, Limitations, and Future Directions
Jingyi Chen, Xiaoyan Guo, Songqiang Chen, Shing-Chi Cheung, Jiasi Shen
TL;DR
This paper evaluates state-of-the-art LLM-based multi-agent systems, exemplified by GitHub Copilot's agent mode, for automating dataset adaptation of SE artifacts across benchmark repositories ROCODE and LogHub2.0. Employing a five-stage pipeline (read, edit, generate/execute, validate/repair, final execution) and two backends (GPT-4.1 and Claude Sonnet 4), the study finds that current agents reliably identify key files and can generate partial adaptations but seldom produce functionally correct implementations. Prompt-level interventions—providing execution error messages, reference code, and explicit bug-location hints—can dramatically boost structural similarity to ground truth (from 7.25% to 67.14%), yet end-to-end functional replication remains rare. The results underscore both the potential of self-correcting, execution-aware multi-agent systems and their current limitations, pointing to future directions in adaptive prompting and inter-agent coordination to enable scalable, reproducible SE research.
Abstract
Automating the adaptation of software engineering (SE) research artifacts across datasets is essential for scalability and reproducibility, yet it remains largely unstudied. Recent advances in large language model (LLM)-based multi-agent systems, such as GitHub Copilot's agent mode, promise to automate complex development workflows through coordinated reasoning, code generation, and tool interaction. This paper presents the first empirical study on how state-of-the-art multi-agent systems perform in dataset adaptation tasks. We evaluate Copilot, backed by GPT-4.1 and Claude Sonnet 4, on adapting SE research artifacts from benchmark repositories including ROCODE and LogHub2.0. Through a five-stage evaluation pipeline (file comprehension, code editing, command generation, validation, and final execution), we measure success rates, analyze failure patterns, and assess prompt-based interventions designed to enhance agent performance. Results show that current systems can identify key files and generate partial adaptations but rarely produce functionally correct implementations. Prompt-level interventions, especially providing execution error messages and reference code, substantially improve structural similarity to ground truth (from 7.25% to 67.14%), highlighting the importance of contextual and feedback-driven guidance. Our findings reveal both the promise and limitations of today's multi-agent LLM systems for dataset adaptation, and suggest concrete directions for building more reliable, self-correcting agents in future SE research.
