Table of Contents
Fetching ...

Multi-View Adaptive Contrastive Learning for Information Retrieval Based Fault Localization

Chunying Zhou, Xiaoyuan Xie, Gong Chen, Peng He, Bing Li

TL;DR

The paper tackles the lexical and informational gaps in information retrieval-based fault localization by introducing MACL-IRFL, a multi-view graph learning framework that jointly models report-code interactions, report-report similarities, and code-code co-citations. It integrates three graphs with a GNN-based embedding layer and a joint learning objective combining supervised fault localization with cross-view contrastive losses to suppress noise and highlight shared signals. Through extensive experiments on five open-source Java projects, the approach achieves superior MAP, MRR, and Accuracy@N scores compared to seven baselines, supported by statistical significance tests. The work demonstrates the practical value of multi-view graph representations and adaptive contrastive learning for robust fault localization, with potential extensions to finer-grained localization and other programming languages.

Abstract

Most studies focused on information retrieval-based techniques for fault localization, which built representations for bug reports and source code files and matched their semantic vectors through similarity measurement. However, such approaches often ignore some useful information that might help improve localization performance, such as 1) the interaction relationship between bug reports and source code files; 2) the similarity relationship between bug reports; and 3) the co-citation relationship between source code files. In this paper, we propose a novel approach named Multi-View Adaptive Contrastive Learning for Information Retrieval Fault Localization (MACL-IRFL) to learn the above-mentioned relationships for software fault localization. Specifically, we first generate data augmentations from report-code interaction view, report-report similarity view and code-code co-citation view separately, and adopt graph neural network to aggregate the information of bug reports or source code files from the three views in the embedding process. Moreover, we perform contrastive learning across these views. Our design of contrastive learning task will force the bug report representations to encode information shared by report-report and report-code views,and the source code file representations shared by code-code and report-code views, thereby alleviating the noise from auxiliary information. Finally, to evaluate the performance of our approach, we conduct extensive experiments on five open-source Java projects. The results show that our model can improve over the best baseline up to 28.93%, 25.57% and 20.35% on Accuracy@1, MAP and MRR, respectively.

Multi-View Adaptive Contrastive Learning for Information Retrieval Based Fault Localization

TL;DR

The paper tackles the lexical and informational gaps in information retrieval-based fault localization by introducing MACL-IRFL, a multi-view graph learning framework that jointly models report-code interactions, report-report similarities, and code-code co-citations. It integrates three graphs with a GNN-based embedding layer and a joint learning objective combining supervised fault localization with cross-view contrastive losses to suppress noise and highlight shared signals. Through extensive experiments on five open-source Java projects, the approach achieves superior MAP, MRR, and Accuracy@N scores compared to seven baselines, supported by statistical significance tests. The work demonstrates the practical value of multi-view graph representations and adaptive contrastive learning for robust fault localization, with potential extensions to finer-grained localization and other programming languages.

Abstract

Most studies focused on information retrieval-based techniques for fault localization, which built representations for bug reports and source code files and matched their semantic vectors through similarity measurement. However, such approaches often ignore some useful information that might help improve localization performance, such as 1) the interaction relationship between bug reports and source code files; 2) the similarity relationship between bug reports; and 3) the co-citation relationship between source code files. In this paper, we propose a novel approach named Multi-View Adaptive Contrastive Learning for Information Retrieval Fault Localization (MACL-IRFL) to learn the above-mentioned relationships for software fault localization. Specifically, we first generate data augmentations from report-code interaction view, report-report similarity view and code-code co-citation view separately, and adopt graph neural network to aggregate the information of bug reports or source code files from the three views in the embedding process. Moreover, we perform contrastive learning across these views. Our design of contrastive learning task will force the bug report representations to encode information shared by report-report and report-code views,and the source code file representations shared by code-code and report-code views, thereby alleviating the noise from auxiliary information. Finally, to evaluate the performance of our approach, we conduct extensive experiments on five open-source Java projects. The results show that our model can improve over the best baseline up to 28.93%, 25.57% and 20.35% on Accuracy@1, MAP and MRR, respectively.
Paper Structure (26 sections, 9 equations, 10 figures, 6 tables)

This paper contains 26 sections, 9 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Illustrate the multi-relationships between bug reports and source code files, where $r_i$ denotes the bug reports, $c_i$ denotes the code files.
  • Figure 2: Bug Reports $\#56029$ and $\#33356$, and their relevant source code files.
  • Figure 3: Bug Reports and their relevant source code files. The node number in the co-citation graph is the index of each source code file.
  • Figure 4: The main framework. Where R-C denotes the report-code interaction view, R-R denotes the report-report similarity view, and C-C denotes the code-code co-citation view. $\mathbf{H}_{R-R}^R$ and $\mathbf{H}_{R-C}^R$ represent the bug report representation matrix updated by report-report view and report-code view after GNN embedding layer. While $\mathbf{H}_{C-C}^C$ and $\mathbf{H}_{R-C}^C$ represent the source code representation matrix updated by code-code view and report-code view after GNN embedding layer.
  • Figure 5: The aggregation process of GNN embedding layer. We take report-code graph with two-layer aggregation as an example. $r_i$ and $c_i$ represent the initial node, while $r_i^{'}$ and $c_i^{'}$ represent the updated nodes.
  • ...and 5 more figures