Table of Contents
Fetching ...

Language Models are Better Bug Detector Through Code-Pair Classification

Kamel Alrashedy, Ahmed Binjahlan

TL;DR

The paper tackles the challenge of bug detection with limited labeled data by proposing code-pair classification, where an LLM must identify the buggy version among a buggy and a fixed function pair. Using PyPIBugs as a real-world benchmark, the authors compare fine-tuning (CodeBERT, CodeT5) and in-context learning (GPT-3.5, CodeLlama) with both binary and pairwise tasks, aided by FAISS-based demonstration retrieval. They find that in-context code-pair classification significantly outperforms other setups, achieving accuracies around 70–73% (F1 ≈ 82–84%) for GPT-3.5 and CodeLlama, while binary and single-snippet tasks underperform, especially on longer functions. This approach reduces reliance on costly fine-tuning and demonstrates practical potential for bug detection workflows in real-world software engineering.

Abstract

Large language models (LLMs) such as GPT-3.5 and CodeLlama are powerful models for code generation and understanding. Fine-tuning these models comes with a high computational cost and requires a large labeled dataset. Alternatively, in-context learning techniques allow models to learn downstream tasks with only a few examples. Recently, researchers have shown how in-context learning performs well in bug detection and repair. In this paper, we propose code-pair classification task in which both the buggy and non-buggy versions are given to the model, and the model identifies the buggy ones. We evaluate our task in real-world dataset of bug detection and two most powerful LLMs. Our experiments indicate that an LLM can often pick the buggy from the non-buggy version of the code, and the code-pair classification task is much easier compared to be given a snippet and deciding if and where a bug exists.

Language Models are Better Bug Detector Through Code-Pair Classification

TL;DR

The paper tackles the challenge of bug detection with limited labeled data by proposing code-pair classification, where an LLM must identify the buggy version among a buggy and a fixed function pair. Using PyPIBugs as a real-world benchmark, the authors compare fine-tuning (CodeBERT, CodeT5) and in-context learning (GPT-3.5, CodeLlama) with both binary and pairwise tasks, aided by FAISS-based demonstration retrieval. They find that in-context code-pair classification significantly outperforms other setups, achieving accuracies around 70–73% (F1 ≈ 82–84%) for GPT-3.5 and CodeLlama, while binary and single-snippet tasks underperform, especially on longer functions. This approach reduces reliance on costly fine-tuning and demonstrates practical potential for bug detection workflows in real-world software engineering.

Abstract

Large language models (LLMs) such as GPT-3.5 and CodeLlama are powerful models for code generation and understanding. Fine-tuning these models comes with a high computational cost and requires a large labeled dataset. Alternatively, in-context learning techniques allow models to learn downstream tasks with only a few examples. Recently, researchers have shown how in-context learning performs well in bug detection and repair. In this paper, we propose code-pair classification task in which both the buggy and non-buggy versions are given to the model, and the model identifies the buggy ones. We evaluate our task in real-world dataset of bug detection and two most powerful LLMs. Our experiments indicate that an LLM can often pick the buggy from the non-buggy version of the code, and the code-pair classification task is much easier compared to be given a snippet and deciding if and where a bug exists.
Paper Structure (10 sections, 1 figure, 1 table)

This paper contains 10 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: Code-pair classification is an in-context learning approach in which the model receives a pair of functions and identifies the buggy one.