Table of Contents
Fetching ...

Testing and Enhancing Multi-Agent Systems for Robust Code Generation

Zongyi Lyu, Songqiang Chen, Zhenlan Ji, Liwen Wang, Shuai Wang, Daoyuan Wu, Wenxuan Wang, Shing-Chi Cheung

TL;DR

This work investigates robustness in multi-agent systems (MASs) used for automatic code generation, revealing that semantic perturbations in user requests cause substantial performance drops despite initial success. The authors design a fuzzing framework with semantic-preserving mutations and a dual fitness function that accounts for both planning and coding artifacts, exposing a pervasive planner-coder gap as the primary source of failures. They introduce a repairing method combining multi-prompt generation and a monitor agent to improve inter-agent communication and interpretation, achieving up to 88.9% repair of previous failures and up to 85.7% fewer failures in subsequent fuzzing. The study provides actionable guidance for building more reliable MAS-based code generation systems and establishes a foundation for ongoing robustness research in multi-agent AI code production.

Abstract

Multi-agent systems (MASs) have emerged as a promising paradigm for automated code generation, demonstrating impressive performance on established benchmarks by decomposing complex coding tasks across specialized agents with different roles. Despite their prosperous development and adoption, their robustness remains pressingly under-explored, raising critical concerns for real-world deployment. This paper presents the first comprehensive study examining the robustness of MASs for code generation through a fuzzing-based testing approach. By designing a fuzzing pipeline incorporating semantic-preserving mutation operators and a novel fitness function, we assess mainstream MASs across multiple datasets and LLMs. Our findings reveal substantial robustness flaws of various popular MASs: they fail to solve 7.9%-83.3% of problems they initially resolved successfully after applying the semantic-preserving mutations. Through comprehensive failure analysis, we identify a common yet largely overlooked cause of the robustness issue: miscommunications between planning and coding agents, where plans lack sufficient detail and coding agents misinterpret intricate logic, aligning with the challenges inherent in a multi-stage information transformation process. Accordingly, we also propose a repairing method that encompasses multi-prompt generation and introduces a new monitor agent to address this issue. Evaluation shows that our repairing method effectively enhances the robustness of MASs by solving 40.0%-88.9% of identified failures. Our work uncovers critical robustness flaws in MASs and provides effective mitigation strategies, contributing essential insights for developing more reliable MASs for code generation.

Testing and Enhancing Multi-Agent Systems for Robust Code Generation

TL;DR

This work investigates robustness in multi-agent systems (MASs) used for automatic code generation, revealing that semantic perturbations in user requests cause substantial performance drops despite initial success. The authors design a fuzzing framework with semantic-preserving mutations and a dual fitness function that accounts for both planning and coding artifacts, exposing a pervasive planner-coder gap as the primary source of failures. They introduce a repairing method combining multi-prompt generation and a monitor agent to improve inter-agent communication and interpretation, achieving up to 88.9% repair of previous failures and up to 85.7% fewer failures in subsequent fuzzing. The study provides actionable guidance for building more reliable MAS-based code generation systems and establishes a foundation for ongoing robustness research in multi-agent AI code production.

Abstract

Multi-agent systems (MASs) have emerged as a promising paradigm for automated code generation, demonstrating impressive performance on established benchmarks by decomposing complex coding tasks across specialized agents with different roles. Despite their prosperous development and adoption, their robustness remains pressingly under-explored, raising critical concerns for real-world deployment. This paper presents the first comprehensive study examining the robustness of MASs for code generation through a fuzzing-based testing approach. By designing a fuzzing pipeline incorporating semantic-preserving mutation operators and a novel fitness function, we assess mainstream MASs across multiple datasets and LLMs. Our findings reveal substantial robustness flaws of various popular MASs: they fail to solve 7.9%-83.3% of problems they initially resolved successfully after applying the semantic-preserving mutations. Through comprehensive failure analysis, we identify a common yet largely overlooked cause of the robustness issue: miscommunications between planning and coding agents, where plans lack sufficient detail and coding agents misinterpret intricate logic, aligning with the challenges inherent in a multi-stage information transformation process. Accordingly, we also propose a repairing method that encompasses multi-prompt generation and introduces a new monitor agent to address this issue. Evaluation shows that our repairing method effectively enhances the robustness of MASs by solving 40.0%-88.9% of identified failures. Our work uncovers critical robustness flaws in MASs and provides effective mitigation strategies, contributing essential insights for developing more reliable MASs for code generation.

Paper Structure

This paper contains 21 sections, 5 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: Pipeline of MASs for code generation.
  • Figure 2: Overall workflow of our repairing method. Multi-prompt generation generates various semantically-equivalent versions of user input. The monitor agent conducts plan interpretation and code check to boost communication between planner and coder.
  • Figure 3: Prompt for the monitor agent. Blue describes the task, brown echoes the five EPs in Sec. \ref{['sec:implication']}, while green provides the i/o format and examples.
  • Figure 4: Comparison of repairing performance when removing different components.
  • Figure 5: Comparison of failures of the original and repaired MASs found in fuzzing.