Source Code Clone Detection Using Unsupervised Similarity Measures
Jorge Martinez-Gil
TL;DR
This work tackles source code clone detection through unsupervised similarity measures, formalizing similarity as $f: S \times S \rightarrow [0,1]$ to score code fragment likeness and enable thresholding. It surveys a broad set of unsupervised measures—ranging from textual and structural to semantic approaches—and evaluates them on the IR-Plag dataset to quantify accuracy and efficiency. Key findings show that a small subset of measures (e.g., Jaccard, N-grams, Winnow, RKR-GST, Output Analysis) can balance detection performance with practical runtime, while others excel in precision or recall but may be computationally prohibitive. The study highlights the need for hybrid approaches, cross-language scalability, and transfer learning to broaden applicability in real-world codebases, offering guidance for engineers on selecting suitable unsupervised methods and identifying avenues for future research.
Abstract
Assessing similarity in source code has gained significant attention in recent years due to its importance in software engineering tasks such as clone detection and code search and recommendation. This work presents a comparative analysis of unsupervised similarity measures for identifying source code clone detection. The goal is to overview the current state-of-the-art techniques, their strengths, and weaknesses. To do that, we compile the existing unsupervised strategies and evaluate their performance on a benchmark dataset to guide software engineers in selecting appropriate methods for their specific use cases. The source code of this study is available at https://github.com/jorge-martinez-gil/codesim
