TriFusion-LLM: Prior-Guided Multimodal Fusion with LLM Arbitration for Fine-grained Code Clone Detection

Mengdi Li; Yuming Liu; He Wang; Zifeng Xu; Yuqing Zhang

TriFusion-LLM: Prior-Guided Multimodal Fusion with LLM Arbitration for Fine-grained Code Clone Detection

Mengdi Li, Yuming Liu, He Wang, Zifeng Xu, Yuqing Zhang

Abstract

Code clone detection (CCD) supports software maintenance, refactoring, and security analysis. Although pre-trained models capture code semantics, most work reduces CCD to binary classification, overlooking the heterogeneity of clone types and the seven fine-grained categories in BigCloneBench. We present Full Model, a multimodal fusion framework that jointly integrates heuristic similarity priors from classical machine learning, structural signals from abstract syntax trees (ASTs), and deep semantic embeddings from CodeBERT into a single predictor. By fusing structural, statistical, and semantic representations, Full Model improves discrimination among fine-grained clone types while keeping inference cost practical. On the seven-class BigCloneBench benchmark, Full Model raises Macro-F1 from 0.695 to 0.875. Ablation studies show that using the primary model's probability distribution as a prior to guide selective arbitration by a large language model (LLM) substantially outperforms blind reclassification; arbitrating only ~0.2% of high-uncertainty samples yields an additional 0.3 absolute Macro-F1 gain. Overall, Full Model achieves an effective performance-cost trade-off for fine-grained CCD and offers a practical solution for large-scale industrial deployment.

TriFusion-LLM: Prior-Guided Multimodal Fusion with LLM Arbitration for Fine-grained Code Clone Detection

Abstract

Paper Structure (33 sections, 2 equations, 5 figures, 7 tables)

This paper contains 33 sections, 2 equations, 5 figures, 7 tables.

Introduction
Related Work
Taxonomy and Benchmarking of Code Clones
Conventional and Augmented Code Clone Detection
Code Representation Learning and Semantic Analysis
Hybrid Representations and Multi-modal Learning
LLM-driven Code Clone Detection
Methodology
Task Definition and Rationale for Fine-grained 7-class Classification
Dataset Construction and Preprocessing
Data Cleansing and Metadata Indexing
Project-level Isolation Protocol
Goal-oriented Stratified Sampling
Diversity Optimization for Type-4 Semantic Clones
Heterogeneous Feature Space Extraction
...and 18 more sections

Figures (5)

Figure 1: Overall pipeline of the proposed multi-dimensional clone detection framework.
Figure 2: Data Processing Pipeline of the Proposed Multi‑Dimensional Clone Detection Framework
Figure 3: Confusion matrix analysis on the test set. (a) Normalized confusion matrix of the proposed model. (b) Difference matrix illustrating performance improvements over the baseline.
Figure 4: Performance behavior across confidence intervals on the validation set. The left figure shows prediction accuracy across bins, while the right figure illustrates the Macro-F1 variation used to determine the arbitration threshold.
Figure 5: Performance behavior across confidence intervals on the validation set. The left figure shows prediction accuracy across bins, while the right figure illustrates the Macro-F1 variation used to determine the arbitration threshold.

TriFusion-LLM: Prior-Guided Multimodal Fusion with LLM Arbitration for Fine-grained Code Clone Detection

Abstract

TriFusion-LLM: Prior-Guided Multimodal Fusion with LLM Arbitration for Fine-grained Code Clone Detection

Authors

Abstract

Table of Contents

Figures (5)