Table of Contents
Fetching ...

Condor: A Code Discriminator Integrating General Semantics with Code Details

Qingyuan Liang, Zhao Zhang, Chen Liu, Zeyu Sun, Wenjie Zhang, Yizhou Chen, Zixiao Zhao, Qi Luo, Wentao Wang, Yanjie Jiang, Yingfei Xiong, Lu Zhang

TL;DR

Large language models struggle to produce correct code on the first attempt for complex tasks. Condor introduces a non-execution-based discriminator that combines embedding-level contrastive learning with data-level augmentation from intermediate edits to detect subtle code differences, and it is trained and evaluated on the CodeNanoFix dataset alongside standard code-generation benchmarks. Results show Condor significantly improves discrimination metrics and Pass@1 across multiple models and datasets, and ablations confirm the value of both proposed strategies. This approach offers a practical, flexible discriminator that enhances reliability of LLM-based code generation without requiring execution environments.

Abstract

LLMs demonstrate significant potential across various software engineering tasks. However, they still face challenges in generating correct code on the first attempt when addressing complex requirements. Introducing a discriminator to select reliable outputs from multiple generated results is an effective way to enhance their reliability and stability. Currently, these discriminators fall into two categories: execution-based discriminators and non-execution-based discriminators. Execution-based discriminators face flexibility challenges due to difficulties in obtaining test cases and security concerns, while non-execution-based discriminators, although more flexible, struggle to capture subtle differences in code details. To maintain flexibility while improving the model's ability to capture fine-grained code details, this paper proposes Condor. We first design contrastive learning to optimize the code representations of the base model, enabling it to reflect differences in code details. Then, we leverage intermediate data from the code modification process to further enrich the discriminator's training data, enhancing its ability to discern code details. Experimental results indicate that on the subtle code difference dataset (i.e., CodeNanoFix), Condor significantly outperforms other discriminators in discriminative performance: Condor (1.3B) improves the discriminative F1 score of DeepSeek-Coder (1.3B) from 67% to 73%. In discriminating LLM-generated outputs, Condor (1.3B) and Condor (110M) raise the Pass@1 score of Meta-Llama-3.1-Instruct (70B) on the CodeNanoFix dataset from 52.64% to 62.63% and 59.64%, respectively. Moreover, Condor demonstrates strong generalization capabilities on the APPS, MBPP, and LiveCodeBench datasets. For example, Condor (1.3B) improves the Pass@1 of Meta-Llama-3.1-Instruct (70B) on the APPS dataset by 147.05%.

Condor: A Code Discriminator Integrating General Semantics with Code Details

TL;DR

Large language models struggle to produce correct code on the first attempt for complex tasks. Condor introduces a non-execution-based discriminator that combines embedding-level contrastive learning with data-level augmentation from intermediate edits to detect subtle code differences, and it is trained and evaluated on the CodeNanoFix dataset alongside standard code-generation benchmarks. Results show Condor significantly improves discrimination metrics and Pass@1 across multiple models and datasets, and ablations confirm the value of both proposed strategies. This approach offers a practical, flexible discriminator that enhances reliability of LLM-based code generation without requiring execution environments.

Abstract

LLMs demonstrate significant potential across various software engineering tasks. However, they still face challenges in generating correct code on the first attempt when addressing complex requirements. Introducing a discriminator to select reliable outputs from multiple generated results is an effective way to enhance their reliability and stability. Currently, these discriminators fall into two categories: execution-based discriminators and non-execution-based discriminators. Execution-based discriminators face flexibility challenges due to difficulties in obtaining test cases and security concerns, while non-execution-based discriminators, although more flexible, struggle to capture subtle differences in code details. To maintain flexibility while improving the model's ability to capture fine-grained code details, this paper proposes Condor. We first design contrastive learning to optimize the code representations of the base model, enabling it to reflect differences in code details. Then, we leverage intermediate data from the code modification process to further enrich the discriminator's training data, enhancing its ability to discern code details. Experimental results indicate that on the subtle code difference dataset (i.e., CodeNanoFix), Condor significantly outperforms other discriminators in discriminative performance: Condor (1.3B) improves the discriminative F1 score of DeepSeek-Coder (1.3B) from 67% to 73%. In discriminating LLM-generated outputs, Condor (1.3B) and Condor (110M) raise the Pass@1 score of Meta-Llama-3.1-Instruct (70B) on the CodeNanoFix dataset from 52.64% to 62.63% and 59.64%, respectively. Moreover, Condor demonstrates strong generalization capabilities on the APPS, MBPP, and LiveCodeBench datasets. For example, Condor (1.3B) improves the Pass@1 of Meta-Llama-3.1-Instruct (70B) on the APPS dataset by 147.05%.

Paper Structure

This paper contains 42 sections, 13 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: An example illustrates how humans write correct code through thinking, and how models rely on a discriminator to select the correct code. The upper section illustrates the interaction with the code evaluation system, where the user attempts to submit the code twice. The lower section displays the correct code selection by a discriminator, where the model may not be able to generate the correct answer on its first attempt. Thus, it is necessary to employ a discriminator to enhance the reliability of the generated outputs.
  • Figure 2: The Condor overview consists of two main components: contrastive learning at the embedding level to capture code details (upper section), and data-level augmentation through intermediate code, which supplements code details that are not recorded in existing datasets (lower section). The 'C' denotes the correct code that passes all test cases, while 'E' indicates the error code that fails some test cases.
  • Figure 3: Case studies showing Condor’s ability to recognize correct solutions and detect errors in code.
  • Figure 4: Illustration of the impact of contrastive learning on code representations. The first three subplots show the 2D embeddings of code after training for 1 epoch (top left), 5 epochs (top right), and 40 epochs (bottom left). The bottom right subplot illustrates the changes in the average distance between correct code snippets and between error and correct code snippets as the number of training epochs increases.