Nexus: Execution-Grounded Multi-Agent Test Oracle Synthesis
Dong Huang, Mingzhe Du, Jie M. Zhang, Zheng Lin, Meng Luo, Qianru Zhang, See-Kiong Ng
TL;DR
Nexus addresses the oracle problem in automated software testing by grounding test-oracle synthesis in execution. It introduces a three-phase pipeline—deliberation by four orthogonal agents, validation against a plausible LLM-generated FUT in a sandbox, and iterative self-refinement using runtime errors. Across seven benchmarks and multiple models, Nexus substantially outperforms baselines, delivering higher oracle accuracy and improved downstream tasks like bug detection and automated program repair. This approach demonstrates that execution-grounded, diverse-agent reasoning can significantly enhance specification-based testing workflows.
Abstract
Test oracle generation in non-regression testing is a longstanding challenge in software engineering, where the goal is to produce oracles that can accurately determine whether a function under test (FUT) behaves as intended for a given input. In this paper, we introduce Nexus, a novel multi-agent framework to address this challenge. Nexus generates test oracles by leveraging a diverse set of specialized agents that synthesize test oracles through a structured process of deliberation, validation, and iterative self-refinement. During the deliberation phase, a panel of four specialist agents, each embodying a distinct testing philosophy, collaboratively critiques and refines an initial set of test oracles. Then, in the validation phase, Nexus generates a plausible candidate implementation of the FUT and executes the proposed oracles against it in a secure sandbox. For any oracle that fails this execution-based check, Nexus activates an automated selfrefinement loop, using the specific runtime error to debug and correct the oracle before re-validation. Our extensive evaluation on seven diverse benchmarks demonstrates that Nexus consistently and substantially outperforms state-of-theart baselines. For instance, Nexus improves the test-level oracle accuracy on the LiveCodeBench from 46.30% to 57.73% for GPT-4.1-Mini. The improved accuracy also significantly enhances downstream tasks: the bug detection rate of GPT4.1-Mini generated test oracles on HumanEval increases from 90.91% to 95.45% for Nexus compared to baselines, and the success rate of automated program repair improves from 35.23% to 69.32%.
