Augmenting the Interpretability of GraphCodeBERT for Code Similarity Tasks
Jorge Martinez-Gil
TL;DR
This paper tackles the challenge of measuring code similarity with deep semantic understanding while preserving interpretability. It proposes a GraphCodeBERT-based framework that tokenizes code, computes token embeddings, and uses attention to reveal semantic relationships between code fragments, complemented by similarity matrices and saliency maps. The approach includes dimensionality-reduction visualizations (PCA, t-SNE, UMAP) and a concrete use case comparing classical sorting algorithms, demonstrating how explanations can be generated for clone detection, refactoring, and plagiarism analysis. By exposing token-level contributions and semantic alignments, the method aims to improve developer trust and facilitate actionable insights in software maintenance and code analysis, with potential for broader tool integration and language-agnostic extensions.
Abstract
Assessing the degree of similarity of code fragments is crucial for ensuring software quality, but it remains challenging due to the need to capture the deeper semantic aspects of code. Traditional syntactic methods often fail to identify these connections. Recent advancements have addressed this challenge, though they frequently sacrifice interpretability. To improve this, we present an approach aiming to improve the transparency of the similarity assessment by using GraphCodeBERT, which enables the identification of semantic relationships between code fragments. This approach identifies similar code fragments and clarifies the reasons behind that identification, helping developers better understand and trust the results. The source code for our implementation is available at https://www.github.com/jorge-martinez-gil/graphcodebert-interpretability.
