SPARC: Scenario Planning and Reasoning for Automated C Unit Test Generation
Jaid Monwar Chowdhury, Chi-An Fu, Reyhaneh Jabbarvand
TL;DR
SPARC targets the persistent semantic gap in automated C unit test generation by combining CFG-based path enumeration with a retrieval-augmented operation map and per-path synthesis. The four-stage pipeline (pre-processing, operation map construction, per-path synthesis, and iterative validation) ensures tests are semantically grounded, compilable, and traceable to specific execution paths, with iterative repairs improving correctness. Empirical evaluation over 59 C projects shows SPARC outperforms vanilla prompting (≈31% higher line and 26% higher branch coverage) and matches or surpasses KLEE on complex subjects, while achieving 94.3% test retention after repair and a 20.78% mutation-score gain. A human study reports superior developer-perceived quality in readability, correctness, completeness, and maintainability, and cost analyses demonstrate that cost-efficient LLMs can match frontier models when used within SPARC’s structured pipeline. Overall, SPARC provides a scalable, industrial-grade approach to automated C test generation that aligns LLM reasoning with program structure and supports deployment across diverse codebases and LLM resources.
Abstract
Automated unit test generation for C remains a formidable challenge due to the semantic gap between high-level program intent and the rigid syntactic constraints of pointer arithmetic and manual memory management. While Large Language Models (LLMs) exhibit strong generative capabilities, direct intent-to-code synthesis frequently suffers from the leap-to-code failure mode, where models prematurely emit code without grounding in program structure, constraints, and semantics. This will result in non-compilable tests, hallucinated function signatures, low branch coverage, and semantically irrelevant assertions that cannot properly capture bugs. We introduce SPARC, a neuro-symbolic, scenario-based framework that bridges this gap through four stages: (1) Control Flow Graph (CFG) analysis, (2) an Operation Map that grounds LLM reasoning in validated utility helpers, (3) Path-targeted test synthesis, and (4) an iterative, self-correction validation loop using compiler and runtime feedback. We evaluate SPARC on 59 real-world and algorithmic subjects, where it outperforms the vanilla prompt generation baseline by 31.36% in line coverage, 26.01% in branch coverage, and 20.78% in mutation score, matching or exceeding the symbolic execution tool KLEE on complex subjects. SPARC retains 94.3% of tests through iterative repair and produces code with significantly higher developer-rated readability and maintainability. By aligning LLM reasoning with program structure, SPARC provides a scalable path for industrial-grade testing of legacy C codebases.
