Table of Contents
Fetching ...

Supporting Cross-language Cross-project Bug Localization Using Pre-trained Language Models

Mahinthan Chandramohan, Dai Quoc Nguyen, Padmanabhan Krishnan, Jovan Jancic

TL;DR

This work tackles the challenge of locating bugs across different languages and projects by leveraging a pre-trained language model trained with supervised contrastive learning to produce robust representations of bug reports and code. A novel ranking strategy combines code-segment analysis and commit-message signals to identify relevant files, code segments, and commits, while a CPU-friendly knowledge distillation pipeline yields a compact model with minimal performance loss. The approach demonstrates strong cross-language and cross-project generalization, achieving high top-k retrieval accuracy and substantial speedups on CPU, making practical deployment feasible. Together, these contributions advance cross-domain bug localization and broaden the accessibility of PLM-based debugging tools in real-world software development.

Abstract

Automatically locating a bug within a large codebase remains a significant challenge for developers. Existing techniques often struggle with generalizability and deployment due to their reliance on application-specific data and large model sizes. This paper proposes a novel pre-trained language model (PLM) based technique for bug localization that transcends project and language boundaries. Our approach leverages contrastive learning to enhance the representation of bug reports and source code. It then utilizes a novel ranking approach that combines commit messages and code segments. Additionally, we introduce a knowledge distillation technique that reduces model size for practical deployment without compromising performance. This paper presents several key benefits. By incorporating code segment and commit message analysis alongside traditional file-level examination, our technique achieves better bug localization accuracy. Furthermore, our model excels at generalizability - trained on code from various projects and languages, it can effectively identify bugs in unseen codebases. To address computational limitations, we propose a CPU-compatible solution. In essence, proposed work presents a highly effective, generalizable, and efficient bug localization technique with the potential to real-world deployment.

Supporting Cross-language Cross-project Bug Localization Using Pre-trained Language Models

TL;DR

This work tackles the challenge of locating bugs across different languages and projects by leveraging a pre-trained language model trained with supervised contrastive learning to produce robust representations of bug reports and code. A novel ranking strategy combines code-segment analysis and commit-message signals to identify relevant files, code segments, and commits, while a CPU-friendly knowledge distillation pipeline yields a compact model with minimal performance loss. The approach demonstrates strong cross-language and cross-project generalization, achieving high top-k retrieval accuracy and substantial speedups on CPU, making practical deployment feasible. Together, these contributions advance cross-domain bug localization and broaden the accessibility of PLM-based debugging tools in real-world software development.

Abstract

Automatically locating a bug within a large codebase remains a significant challenge for developers. Existing techniques often struggle with generalizability and deployment due to their reliance on application-specific data and large model sizes. This paper proposes a novel pre-trained language model (PLM) based technique for bug localization that transcends project and language boundaries. Our approach leverages contrastive learning to enhance the representation of bug reports and source code. It then utilizes a novel ranking approach that combines commit messages and code segments. Additionally, we introduce a knowledge distillation technique that reduces model size for practical deployment without compromising performance. This paper presents several key benefits. By incorporating code segment and commit message analysis alongside traditional file-level examination, our technique achieves better bug localization accuracy. Furthermore, our model excels at generalizability - trained on code from various projects and languages, it can effectively identify bugs in unseen codebases. To address computational limitations, we propose a CPU-compatible solution. In essence, proposed work presents a highly effective, generalizable, and efficient bug localization technique with the potential to real-world deployment.
Paper Structure (20 sections, 4 equations, 4 figures, 6 tables, 7 algorithms)

This paper contains 20 sections, 4 equations, 4 figures, 6 tables, 7 algorithms.

Figures (4)

  • Figure 1: High-level bug-localization framework
  • Figure 2: Proposed ranking technique
  • Figure 3: High-level fine-tuning architecture
  • Figure 4: Knowledge distillation: High-level architecture.