Code Clone Detection via an AlphaFold-Inspired Framework
Changguo Jia, Yi Zhan, Tianqi Zhao, Hengzhi Ye, Minghui Zhou
TL;DR
Code clone detection faces a semantic gap when relying on single token sequences. AlphaCC adapts AlphaFold’s sequence-to-structure paradigm to code by constructing a Code MSA from retrieved lexically similar code fragments and encoding it with Codeformer, a dual-attention module that processes both within-sequence and cross-sequence relationships. It then uses a late-interaction similarity followed by a margin loss to classify clone pairs, achieving state-of-the-art F1 on GCJ, BigCloneBench, and OJClone while remaining tool-independent and efficient. This approach enables robust semantic clone detection without third-party analyzers, offering practical scalability across large-scale code bases and multiple languages.
Abstract
Code clone detection plays a critical role in software maintenance and vulnerability analysis. Substantial methods have been proposed to detect code clones. However, they struggle to extract high-level program semantics directly from a single linear token sequence, leading to unsatisfactory detection performance. A similar single-sequence challenge has been successfully addressed in protein structure prediction by AlphaFold. Motivated by the successful resolution of the shared single-sequence challenge by AlphaFold, as well as the sequential similarities between proteins and code, we leverage AlphaFold for code clone detection. In particular, we propose AlphaCC, which represents code fragments as token sequences and adapts AlphaFold's sequence-to-structure modeling capability to infer code semantics. The pipeline of AlphaCC goes through three steps. First, AlphaCC transforms each input code fragment into a token sequence and, motivated by AlphaFold's use of multiple sequence alignment (MSA), novelly uses a retrieval-augmentation strategy to construct an MSA from lexically similar token sequences. Second, AlphaCC adopts a modified attention-based encoder based on AlphaFold to model dependencies within and across token sequences. Finally, unlike AlphaFold's protein structure prediction task, AlphaCC computes similarity scores between token sequences through a late interaction strategy and performs binary classification to determine code clone pairs. Comprehensive evaluations on three datasets, particularly two semantic clone detection datasets, show that AlphaCC consistently outperforms all baselines, demonstrating strong semantic understanding. AlphaCC further achieves strong performance on instances where tool-dependent methods fail, highlighting its tool-independence. Moreover, AlphaCC maintains competitive efficiency, enabling practical usage in large-scale clone detection tasks.
