Detecting Essence Code Clones via Information Theoretic Analysis
Lida Zhao, Shihan Dou, Yutao Hu, Yueming Wu, Jiahui Wu, Chengwei Liu, Lyuye Zhang, Yi Liu, Jun Sun, Xuanjing Huang, Yang Liu
TL;DR
This work defines essence clones as core-logic–driven relationships within code blocks and introduces ECScan, an information-theoretic detector that weights lines by their information content. Using TF-IDF-inspired line weights and a four-stage pipeline (Lexical Analysis, Weight Assignment, Locate & Filter, Verify), ECScan emphasizes semantic core logic to detect essence clones regardless of peripheral differences. Evaluations on BigCloneBench and four real-world Java projects show ECScan achieves high precision and robust recall for essence and general Type-3 clones, with an average F1 around 0.85 on essence detection and competitive performance against leading detectors for broader clone types. The method demonstrates scalable performance and practical utility for developers seeking to improve code quality and maintainability by focusing on semantically meaningful code segments, while also acknowledging limitations in extremely large datasets and edge cases with little core logic.
Abstract
Code cloning, a widespread practice in software development, involves replicating code fragments to save time but often at the expense of software maintainability and quality. In this paper, we address the specific challenge of detecting "essence clones", a complex subtype of Type-3 clones characterized by sharing critical logic despite different peripheral codes. Traditional techniques often fail to detect essence clones due to their syntactic focus. To overcome this limitation, we introduce ECScan, a novel detection tool that leverages information theory to assess the semantic importance of code lines. By assigning weights to each line based on its information content, ECScan emphasizes core logic over peripheral code differences. Our comprehensive evaluation across various real-world projects shows that ECScan significantly outperforms existing tools in detecting essence clones, achieving an average F1-score of 85%. It demonstrates robust performance across all clone types and offers exceptional scalability. This study advances clone detection by providing a practical tool for developers to enhance code quality and reduce maintenance burdens, emphasizing the semantic aspects of code through an innovative information-theoretic approach.
