Source Code Clone Detection Using Unsupervised Similarity Measures

Jorge Martinez-Gil

Source Code Clone Detection Using Unsupervised Similarity Measures

Jorge Martinez-Gil

TL;DR

This work tackles source code clone detection through unsupervised similarity measures, formalizing similarity as $f: S \times S \rightarrow [0,1]$ to score code fragment likeness and enable thresholding. It surveys a broad set of unsupervised measures—ranging from textual and structural to semantic approaches—and evaluates them on the IR-Plag dataset to quantify accuracy and efficiency. Key findings show that a small subset of measures (e.g., Jaccard, N-grams, Winnow, RKR-GST, Output Analysis) can balance detection performance with practical runtime, while others excel in precision or recall but may be computationally prohibitive. The study highlights the need for hybrid approaches, cross-language scalability, and transfer learning to broaden applicability in real-world codebases, offering guidance for engineers on selecting suitable unsupervised methods and identifying avenues for future research.

Abstract

Assessing similarity in source code has gained significant attention in recent years due to its importance in software engineering tasks such as clone detection and code search and recommendation. This work presents a comparative analysis of unsupervised similarity measures for identifying source code clone detection. The goal is to overview the current state-of-the-art techniques, their strengths, and weaknesses. To do that, we compile the existing unsupervised strategies and evaluate their performance on a benchmark dataset to guide software engineers in selecting appropriate methods for their specific use cases. The source code of this study is available at https://github.com/jorge-martinez-gil/codesim

Source Code Clone Detection Using Unsupervised Similarity Measures

TL;DR

This work tackles source code clone detection through unsupervised similarity measures, formalizing similarity as

to score code fragment likeness and enable thresholding. It surveys a broad set of unsupervised measures—ranging from textual and structural to semantic approaches—and evaluates them on the IR-Plag dataset to quantify accuracy and efficiency. Key findings show that a small subset of measures (e.g., Jaccard, N-grams, Winnow, RKR-GST, Output Analysis) can balance detection performance with practical runtime, while others excel in precision or recall but may be computationally prohibitive. The study highlights the need for hybrid approaches, cross-language scalability, and transfer learning to broaden applicability in real-world codebases, offering guidance for engineers on selecting suitable unsupervised methods and identifying avenues for future research.

Abstract

Paper Structure (15 sections, 3 figures, 2 tables)

This paper contains 15 sections, 3 figures, 2 tables.

Introduction
Background
Problem definition
Similarity categories
The importance of unsupervised measures
Future perspectives
Methods
Unsupervised methods
Examples
Evaluation
Dataset
Results
Other metrics
Discussion
Conclusion

Figures (3)

Figure 1: Accuracy of the unsupervised semantic similarity measures when performing clone detection
Figure 2: Execution time of the unsupervised semantic similarity measures when performing clone detection
Figure 3: Comparison of the feasibility index of the unsupervised methods

Source Code Clone Detection Using Unsupervised Similarity Measures

TL;DR

Abstract

Source Code Clone Detection Using Unsupervised Similarity Measures

Authors

TL;DR

Abstract

Table of Contents

Figures (3)