Table of Contents
Fetching ...

Synthesizing Performance Constraints for Evaluating and Improving Code Efficiency

Jun Yang, Cheng-Chi Wang, Bogdan Alexandru Stoica, Kexin Pei

TL;DR

This work tackles the challenge of evaluating and improving code efficiency in the era of LLM-driven code optimization by introducing WEDGE, a framework that couples contrastive performance constraint reasoning with coverage-guided fuzzing to synthesize and instrument performance-stressing tests. By extracting local, performance-characterizing constraints from contrastive execution traces and guiding fuzzing through constraint-aware mutators and checkers, WEDGE generates inputs that reveal nuanced bottlenecks beyond simple length-stressing. The resulting PerfForge benchmark demonstrates that these tests drive significantly stronger performance stress, improve the reliability of subsequent optimization workflows, and enable fairer comparisons across baseline approaches. The framework is validated on CodeContests problems, with extensive ablations and sensitivity analyses, and released publicly to foster future research in robust, performance-oriented code evaluation and optimization.

Abstract

Large Language Models (LLMs) have been increasingly used to optimize code efficiency. Evaluating their effectiveness and further suggesting optimization opportunities often rely on high-quality tests to demonstrate the performance bottlenecks presented in the program. However, existing approaches rely on a limited set of hand-curated inputs or LLM-generated uninteresting length-stressing tests, failing to reveal more nuanced optimization opportunities. We present WEDGE, a framework for generating performance-stressing input given the program under test. WEDGE synthesizes explicit performance-characterizing constraints in the form of branch conditions to partition the programs' execution space into performance-specific regions. When integrated with the coverage-guided fuzzer, reaching different regions introduces explicit rewards for test generation to explore inefficient implementations. Our evaluation shows that WEDGE introduces a significant slowdown compared to the tests in CodeContests and those claimed to be optimized by existing approaches. From the utility perspective, integrating our tests substantially improves the existing code optimization approaches that rely on test-driven execution feedback. We release PERFFORGE, the performance tests generated by WEDGE, to benchmark future approaches for efficient code generation at https://github.com/UChiSeclab/perfforge.

Synthesizing Performance Constraints for Evaluating and Improving Code Efficiency

TL;DR

This work tackles the challenge of evaluating and improving code efficiency in the era of LLM-driven code optimization by introducing WEDGE, a framework that couples contrastive performance constraint reasoning with coverage-guided fuzzing to synthesize and instrument performance-stressing tests. By extracting local, performance-characterizing constraints from contrastive execution traces and guiding fuzzing through constraint-aware mutators and checkers, WEDGE generates inputs that reveal nuanced bottlenecks beyond simple length-stressing. The resulting PerfForge benchmark demonstrates that these tests drive significantly stronger performance stress, improve the reliability of subsequent optimization workflows, and enable fairer comparisons across baseline approaches. The framework is validated on CodeContests problems, with extensive ablations and sensitivity analyses, and released publicly to foster future research in robust, performance-oriented code evaluation and optimization.

Abstract

Large Language Models (LLMs) have been increasingly used to optimize code efficiency. Evaluating their effectiveness and further suggesting optimization opportunities often rely on high-quality tests to demonstrate the performance bottlenecks presented in the program. However, existing approaches rely on a limited set of hand-curated inputs or LLM-generated uninteresting length-stressing tests, failing to reveal more nuanced optimization opportunities. We present WEDGE, a framework for generating performance-stressing input given the program under test. WEDGE synthesizes explicit performance-characterizing constraints in the form of branch conditions to partition the programs' execution space into performance-specific regions. When integrated with the coverage-guided fuzzer, reaching different regions introduces explicit rewards for test generation to explore inefficient implementations. Our evaluation shows that WEDGE introduces a significant slowdown compared to the tests in CodeContests and those claimed to be optimized by existing approaches. From the utility perspective, integrating our tests substantially improves the existing code optimization approaches that rely on test-driven execution feedback. We release PERFFORGE, the performance tests generated by WEDGE, to benchmark future approaches for efficient code generation at https://github.com/UChiSeclab/perfforge.

Paper Structure

This paper contains 41 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Workflow of Wedge. First, our tool profiles the code-under-test to identify a pair of inputs with contrastive execution profile ("fast" vs "slow" execution). Second, with this information, it asks a LLM to infer performance-characterizing constraints and instrument the code with checkers. Third, it runs the instrumented code through a customized fuzzing tool to find performance-stressing inputs.
  • Figure 2: Motivating example from Codeforces (prob. 633A, sol. 622) showing how Wedge reasons about and generate performance-characterizing constraints, and implements corresponding checkers.
  • Figure 3: A head-to-head comparison between PerfForge (■) and the baseline tests (■). The bars represent the number of programs where one incurs a larger number of CPU instructions. x-axis shows the corresponding ratio between the corresponding CPU instruction counts. Since the two EvalPerf variants show similar distributions, we only include EvalPerfslow here (see Section \ref{['appx:head-to-head']}).