Table of Contents
Fetching ...

Nexus: Execution-Grounded Multi-Agent Test Oracle Synthesis

Dong Huang, Mingzhe Du, Jie M. Zhang, Zheng Lin, Meng Luo, Qianru Zhang, See-Kiong Ng

TL;DR

Nexus addresses the oracle problem in automated software testing by grounding test-oracle synthesis in execution. It introduces a three-phase pipeline—deliberation by four orthogonal agents, validation against a plausible LLM-generated FUT in a sandbox, and iterative self-refinement using runtime errors. Across seven benchmarks and multiple models, Nexus substantially outperforms baselines, delivering higher oracle accuracy and improved downstream tasks like bug detection and automated program repair. This approach demonstrates that execution-grounded, diverse-agent reasoning can significantly enhance specification-based testing workflows.

Abstract

Test oracle generation in non-regression testing is a longstanding challenge in software engineering, where the goal is to produce oracles that can accurately determine whether a function under test (FUT) behaves as intended for a given input. In this paper, we introduce Nexus, a novel multi-agent framework to address this challenge. Nexus generates test oracles by leveraging a diverse set of specialized agents that synthesize test oracles through a structured process of deliberation, validation, and iterative self-refinement. During the deliberation phase, a panel of four specialist agents, each embodying a distinct testing philosophy, collaboratively critiques and refines an initial set of test oracles. Then, in the validation phase, Nexus generates a plausible candidate implementation of the FUT and executes the proposed oracles against it in a secure sandbox. For any oracle that fails this execution-based check, Nexus activates an automated selfrefinement loop, using the specific runtime error to debug and correct the oracle before re-validation. Our extensive evaluation on seven diverse benchmarks demonstrates that Nexus consistently and substantially outperforms state-of-theart baselines. For instance, Nexus improves the test-level oracle accuracy on the LiveCodeBench from 46.30% to 57.73% for GPT-4.1-Mini. The improved accuracy also significantly enhances downstream tasks: the bug detection rate of GPT4.1-Mini generated test oracles on HumanEval increases from 90.91% to 95.45% for Nexus compared to baselines, and the success rate of automated program repair improves from 35.23% to 69.32%.

Nexus: Execution-Grounded Multi-Agent Test Oracle Synthesis

TL;DR

Nexus addresses the oracle problem in automated software testing by grounding test-oracle synthesis in execution. It introduces a three-phase pipeline—deliberation by four orthogonal agents, validation against a plausible LLM-generated FUT in a sandbox, and iterative self-refinement using runtime errors. Across seven benchmarks and multiple models, Nexus substantially outperforms baselines, delivering higher oracle accuracy and improved downstream tasks like bug detection and automated program repair. This approach demonstrates that execution-grounded, diverse-agent reasoning can significantly enhance specification-based testing workflows.

Abstract

Test oracle generation in non-regression testing is a longstanding challenge in software engineering, where the goal is to produce oracles that can accurately determine whether a function under test (FUT) behaves as intended for a given input. In this paper, we introduce Nexus, a novel multi-agent framework to address this challenge. Nexus generates test oracles by leveraging a diverse set of specialized agents that synthesize test oracles through a structured process of deliberation, validation, and iterative self-refinement. During the deliberation phase, a panel of four specialist agents, each embodying a distinct testing philosophy, collaboratively critiques and refines an initial set of test oracles. Then, in the validation phase, Nexus generates a plausible candidate implementation of the FUT and executes the proposed oracles against it in a secure sandbox. For any oracle that fails this execution-based check, Nexus activates an automated selfrefinement loop, using the specific runtime error to debug and correct the oracle before re-validation. Our extensive evaluation on seven diverse benchmarks demonstrates that Nexus consistently and substantially outperforms state-of-theart baselines. For instance, Nexus improves the test-level oracle accuracy on the LiveCodeBench from 46.30% to 57.73% for GPT-4.1-Mini. The improved accuracy also significantly enhances downstream tasks: the bug detection rate of GPT4.1-Mini generated test oracles on HumanEval increases from 90.91% to 95.45% for Nexus compared to baselines, and the success rate of automated program repair improves from 35.23% to 69.32%.

Paper Structure

This paper contains 48 sections, 1 figure, 7 tables.

Figures (1)

  • Figure 1: The architecture of the Nexus framework. (1) Deliberation: A panel of specialist agents collaborates to produce candidate oracles. (2) Validation: Oracles are validated by executing them against a plausible, LLM-generated implementation of the function under test. (3) Self-Refinement: Failed oracles enter an iterative loop where runtime errors are used for automated debugging. This pipeline grounds abstract deliberation in executable evidence.