Table of Contents
Fetching ...

Augmenting the Interpretability of GraphCodeBERT for Code Similarity Tasks

Jorge Martinez-Gil

TL;DR

This paper tackles the challenge of measuring code similarity with deep semantic understanding while preserving interpretability. It proposes a GraphCodeBERT-based framework that tokenizes code, computes token embeddings, and uses attention to reveal semantic relationships between code fragments, complemented by similarity matrices and saliency maps. The approach includes dimensionality-reduction visualizations (PCA, t-SNE, UMAP) and a concrete use case comparing classical sorting algorithms, demonstrating how explanations can be generated for clone detection, refactoring, and plagiarism analysis. By exposing token-level contributions and semantic alignments, the method aims to improve developer trust and facilitate actionable insights in software maintenance and code analysis, with potential for broader tool integration and language-agnostic extensions.

Abstract

Assessing the degree of similarity of code fragments is crucial for ensuring software quality, but it remains challenging due to the need to capture the deeper semantic aspects of code. Traditional syntactic methods often fail to identify these connections. Recent advancements have addressed this challenge, though they frequently sacrifice interpretability. To improve this, we present an approach aiming to improve the transparency of the similarity assessment by using GraphCodeBERT, which enables the identification of semantic relationships between code fragments. This approach identifies similar code fragments and clarifies the reasons behind that identification, helping developers better understand and trust the results. The source code for our implementation is available at https://www.github.com/jorge-martinez-gil/graphcodebert-interpretability.

Augmenting the Interpretability of GraphCodeBERT for Code Similarity Tasks

TL;DR

This paper tackles the challenge of measuring code similarity with deep semantic understanding while preserving interpretability. It proposes a GraphCodeBERT-based framework that tokenizes code, computes token embeddings, and uses attention to reveal semantic relationships between code fragments, complemented by similarity matrices and saliency maps. The approach includes dimensionality-reduction visualizations (PCA, t-SNE, UMAP) and a concrete use case comparing classical sorting algorithms, demonstrating how explanations can be generated for clone detection, refactoring, and plagiarism analysis. By exposing token-level contributions and semantic alignments, the method aims to improve developer trust and facilitate actionable insights in software maintenance and code analysis, with potential for broader tool integration and language-agnostic extensions.

Abstract

Assessing the degree of similarity of code fragments is crucial for ensuring software quality, but it remains challenging due to the need to capture the deeper semantic aspects of code. Traditional syntactic methods often fail to identify these connections. Recent advancements have addressed this challenge, though they frequently sacrifice interpretability. To improve this, we present an approach aiming to improve the transparency of the similarity assessment by using GraphCodeBERT, which enables the identification of semantic relationships between code fragments. This approach identifies similar code fragments and clarifies the reasons behind that identification, helping developers better understand and trust the results. The source code for our implementation is available at https://www.github.com/jorge-martinez-gil/graphcodebert-interpretability.
Paper Structure (23 sections, 14 equations, 8 figures)

This paper contains 23 sections, 14 equations, 8 figures.

Figures (8)

  • Figure 1: Visualization of the similarity relationships among various classical sorting algorithms using GraphCodeBERT. This heatmap shows how similar sorting algorithms are, based on their structure and behavior, using GraphCodeBERT. Algorithms like Bubble Sort and Insertion Sort, which follow similar step-by-step processes, show higher similarity scores. On the other hand, Quick Sort, which uses a more complex partitioning approach, shows lower similarity with Bubble Sort.
  • Figure 2: Pairwise comparisons of classical sorting algorithms using PCA, showing the token embeddings in a 2D space (Part 1).
  • Figure 3: Pairwise comparisons of classical sorting algorithms using PCA, showing the token embeddings in a 2D space (Part 2).
  • Figure 4: Pairwise comparisons of classical sorting algorithms using t-SNE, showing the token embeddings in a 2D space (Part 1).
  • Figure 5: Pairwise comparisons of classical sorting algorithms using t-SNE, showing the token embeddings in a 2D space (Part 2).
  • ...and 3 more figures