Table of Contents
Fetching ...

On the Use of Deep Learning Models for Semantic Clone Detection

Subroto Nag Pinku, Debajyoti Mondal, Chanchal K. Roy

TL;DR

The paper addresses semantic and cross-language code clone detection, highlighting limitations of standard benchmarks like BigCloneBench and introducing GPTCloneBench for broader evaluation. It conducts a multi-step assessment of five state-of-the-art models (GMN, ASTNN, CodeBERT, CLCDSA, C4) across multiple datasets and mutation-based test suites to probe robustness and generalization. Key findings show single-language models perform well on BigCloneBench but variably on semantic benchmarks, while the cross-language model C4 often yields the best semantic-clone performance and demonstrates robustness to mutations. The work emphasizes dataset properties and representation choices as major factors in performance, recommending multi-dataset, mutation-aware evaluation and highlighting C4 as a strong, language-agnostic option for semantic clone detection with practical implications for software maintenance and tooling.

Abstract

Detecting and tracking code clones can ease various software development and maintenance tasks when changes in a code fragment should be propagated over all its copies. Several deep learning-based clone detection models have appeared in the literature for detecting syntactic and semantic clones, widely evaluated with the BigCloneBench dataset. However, class imbalance and the small number of semantic clones make BigCloneBench less ideal for interpreting model performance. Researchers also use other datasets such as GoogleCodeJam, OJClone, and SemanticCloneBench to understand model generalizability. To overcome the limitations of existing datasets, the GPT-assisted semantic and cross-language clone dataset GPTCloneBench has been released. However, how these models compare across datasets remains unclear. In this paper, we propose a multi-step evaluation approach for five state-of-the-art clone detection models leveraging existing benchmark datasets, including GPTCloneBench, and using mutation operators to study model ability. Specifically, we examine three highly-performing single-language models (ASTNN, GMN, CodeBERT) on BigCloneBench, SemanticCloneBench, and GPTCloneBench, testing their robustness with mutation operations. Additionally, we compare them against cross-language models (C4, CLCDSA) known for detecting semantic clones. While single-language models show high F1 scores for BigCloneBench, their performance on SemanticCloneBench varies (up to 20%). Interestingly, the cross-language model (C4) shows superior performance (around 7%) on SemanticCloneBench over other models and performs similarly on BigCloneBench and GPTCloneBench. On mutation-based datasets, C4 has more robust performance (less than 1% difference) compared to single-language models, which show high variability.

On the Use of Deep Learning Models for Semantic Clone Detection

TL;DR

The paper addresses semantic and cross-language code clone detection, highlighting limitations of standard benchmarks like BigCloneBench and introducing GPTCloneBench for broader evaluation. It conducts a multi-step assessment of five state-of-the-art models (GMN, ASTNN, CodeBERT, CLCDSA, C4) across multiple datasets and mutation-based test suites to probe robustness and generalization. Key findings show single-language models perform well on BigCloneBench but variably on semantic benchmarks, while the cross-language model C4 often yields the best semantic-clone performance and demonstrates robustness to mutations. The work emphasizes dataset properties and representation choices as major factors in performance, recommending multi-dataset, mutation-aware evaluation and highlighting C4 as a strong, language-agnostic option for semantic clone detection with practical implications for software maintenance and tooling.

Abstract

Detecting and tracking code clones can ease various software development and maintenance tasks when changes in a code fragment should be propagated over all its copies. Several deep learning-based clone detection models have appeared in the literature for detecting syntactic and semantic clones, widely evaluated with the BigCloneBench dataset. However, class imbalance and the small number of semantic clones make BigCloneBench less ideal for interpreting model performance. Researchers also use other datasets such as GoogleCodeJam, OJClone, and SemanticCloneBench to understand model generalizability. To overcome the limitations of existing datasets, the GPT-assisted semantic and cross-language clone dataset GPTCloneBench has been released. However, how these models compare across datasets remains unclear. In this paper, we propose a multi-step evaluation approach for five state-of-the-art clone detection models leveraging existing benchmark datasets, including GPTCloneBench, and using mutation operators to study model ability. Specifically, we examine three highly-performing single-language models (ASTNN, GMN, CodeBERT) on BigCloneBench, SemanticCloneBench, and GPTCloneBench, testing their robustness with mutation operations. Additionally, we compare them against cross-language models (C4, CLCDSA) known for detecting semantic clones. While single-language models show high F1 scores for BigCloneBench, their performance on SemanticCloneBench varies (up to 20%). Interestingly, the cross-language model (C4) shows superior performance (around 7%) on SemanticCloneBench over other models and performs similarly on BigCloneBench and GPTCloneBench. On mutation-based datasets, C4 has more robust performance (less than 1% difference) compared to single-language models, which show high variability.

Paper Structure

This paper contains 40 sections, 3 equations, 3 figures, 15 tables.

Figures (3)

  • Figure 1: Example of semantic code clones. These code fragments implement the same function. The code in (a) and (b) are single-language clones. Code in (c) is a cross-language clone with (a) and (b).
  • Figure 2: Overview of our approach. The left side shows the dataset processing. The right side shows the input representations and clone detection models.
  • Figure 3: Models' performance when trained and tested on benchmark datasets