Table of Contents
Fetching ...

Source Code Clone Detection Using Unsupervised Similarity Measures

Jorge Martinez-Gil

TL;DR

This work tackles source code clone detection through unsupervised similarity measures, formalizing similarity as $f: S \times S \rightarrow [0,1]$ to score code fragment likeness and enable thresholding. It surveys a broad set of unsupervised measures—ranging from textual and structural to semantic approaches—and evaluates them on the IR-Plag dataset to quantify accuracy and efficiency. Key findings show that a small subset of measures (e.g., Jaccard, N-grams, Winnow, RKR-GST, Output Analysis) can balance detection performance with practical runtime, while others excel in precision or recall but may be computationally prohibitive. The study highlights the need for hybrid approaches, cross-language scalability, and transfer learning to broaden applicability in real-world codebases, offering guidance for engineers on selecting suitable unsupervised methods and identifying avenues for future research.

Abstract

Assessing similarity in source code has gained significant attention in recent years due to its importance in software engineering tasks such as clone detection and code search and recommendation. This work presents a comparative analysis of unsupervised similarity measures for identifying source code clone detection. The goal is to overview the current state-of-the-art techniques, their strengths, and weaknesses. To do that, we compile the existing unsupervised strategies and evaluate their performance on a benchmark dataset to guide software engineers in selecting appropriate methods for their specific use cases. The source code of this study is available at https://github.com/jorge-martinez-gil/codesim

Source Code Clone Detection Using Unsupervised Similarity Measures

TL;DR

This work tackles source code clone detection through unsupervised similarity measures, formalizing similarity as to score code fragment likeness and enable thresholding. It surveys a broad set of unsupervised measures—ranging from textual and structural to semantic approaches—and evaluates them on the IR-Plag dataset to quantify accuracy and efficiency. Key findings show that a small subset of measures (e.g., Jaccard, N-grams, Winnow, RKR-GST, Output Analysis) can balance detection performance with practical runtime, while others excel in precision or recall but may be computationally prohibitive. The study highlights the need for hybrid approaches, cross-language scalability, and transfer learning to broaden applicability in real-world codebases, offering guidance for engineers on selecting suitable unsupervised methods and identifying avenues for future research.

Abstract

Assessing similarity in source code has gained significant attention in recent years due to its importance in software engineering tasks such as clone detection and code search and recommendation. This work presents a comparative analysis of unsupervised similarity measures for identifying source code clone detection. The goal is to overview the current state-of-the-art techniques, their strengths, and weaknesses. To do that, we compile the existing unsupervised strategies and evaluate their performance on a benchmark dataset to guide software engineers in selecting appropriate methods for their specific use cases. The source code of this study is available at https://github.com/jorge-martinez-gil/codesim
Paper Structure (15 sections, 3 figures, 2 tables)

This paper contains 15 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Accuracy of the unsupervised semantic similarity measures when performing clone detection
  • Figure 2: Execution time of the unsupervised semantic similarity measures when performing clone detection
  • Figure 3: Comparison of the feasibility index of the unsupervised methods