Functional Overlap Reranking for Neural Code Generation

Hung Quoc To; Minh Huynh Nguyen; Nghi D. Q. Bui

Functional Overlap Reranking for Neural Code Generation

Hung Quoc To, Minh Huynh Nguyen, Nghi D. Q. Bui

TL;DR

SRank tackles code-solution reranking by modeling inter-cluster functional overlap among execution-based clusters. It clusters solutions by identical outputs, computes an inter-cluster interaction matrix, and ranks clusters with a final score $\mathbf{R} = \mathbf{I} \cdot \mathbf{V}$, where $\mathbf{V}$ encodes cluster features. Across multiple CodeLLMs and benchmarks (HumanEval, MBPP-S, APPS), SRank consistently outperforms state-of-the-art methods such as CodeT and Coder-Reviewer in pass@1, demonstrating robustness even with limited samples. The work highlights practical impact for selecting correct code under resource constraints and discusses future work on multilingual support and efficiency improvements.

Abstract

Code Large Language Models (CodeLLMs) have ushered in a new era in code generation advancements. However, selecting the best code solutions from all possible CodeLLM outputs remains a challenge. Previous methods often overlooked the intricate functional similarities and interactions between solution clusters. We introduce SRank, a novel reranking strategy for selecting the best solutions from code generation, focusing on modeling the relationships between clusters of solutions. By quantifying the functional overlap between solution clusters, our approach provides a better ranking strategy for code solutions. Empirical results show that our method achieves remarkable results on the pass@1 score. For instance, on the Human-Eval benchmark, we achieve 69.66% in pass@1 with Codex002, 75.31% with WizardCoder, 53.99% with StarCoder, and 60.55% with CodeGen, surpassing state-of-the-art code generation reranking methods such as CodeT and Coder-Reviewer on the same CodeLLM by a significant margin (approximately 6.1% improvement on average). Even in scenarios with a limited number of sampled solutions and test cases, our approach demonstrates robustness and superiority, marking a new benchmark in code generation reranking. Our implementation can be found at https://github.com/FSoft-AI4Code/SRank-CodeRanker.

Functional Overlap Reranking for Neural Code Generation

TL;DR

, where

encodes cluster features. Across multiple CodeLLMs and benchmarks (HumanEval, MBPP-S, APPS), SRank consistently outperforms state-of-the-art methods such as CodeT and Coder-Reviewer in pass@1, demonstrating robustness even with limited samples. The work highlights practical impact for selecting correct code under resource constraints and discusses future work on multilingual support and efficiency improvements.

Abstract

Paper Structure (39 sections, 7 equations, 20 figures, 7 tables)

This paper contains 39 sections, 7 equations, 20 figures, 7 tables.

Introduction
Background & Motivation
Code Generation
Solution Clustering and Reranking
Modeling Inter-Cluster Relationships
Approach Details
Overview
Clustering Solutions by Execution Outputs
Computing Interaction Matrix
Computing Final Ranking Scores
Experimental Setup
Models
Metrics
Baselines
Benchmarks
...and 24 more sections

Figures (20)

Figure 1: Illustration on concept of "functional overlap" among clusters of solutions. Cluster 1 outputs [10,20,30,40]. Cluster 2's output is [11,20,30,40]. Cluster 3's output is [10,20,30,40]. As a result, Cluster 1 overlaps Cluster 2 on three values [20,30,40], indicating that they are 3/4 overlapped. Cluster 1 overlaps Cluster 3 on three values [10,20,30], which can also be considered 3/4 overlapped. Cluster 1 has a functional overlapping score of 3 + 3 = 6. Cluster 2 overlaps with Cluster 3 on two values [20,30], resulting in a functional overlapping score of 2 + 3 = 5, and Cluster 3 has a functional overlapping score of 5. Thus, Cluster 1 has the highest cumulative functional overlap, is most representative and likely to be the optimal solution.
Figure 2: Method overview.
Figure 3: Probability of incorrect solutions varied based on the degree of functional agreement on HumanEval.
Figure 4: Ablation study on scaling number of model generated test cases vs. pass@1 on HumanEval.
Figure 5: Ablation study on scaling number of sampled solutions vs. pass@1 on HumanEval.
...and 15 more figures

Functional Overlap Reranking for Neural Code Generation

TL;DR

Abstract

Functional Overlap Reranking for Neural Code Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (20)