FDLLM: A Dedicated Detector for Black-Box LLMs Fingerprinting
Zhiyuan Fu, Junfan Chen, Lan Zhang, Ting Yang, Jun Niu, Hongyu Sun, Ruidong Li, Peng Liu, Jice Wang, Fannv He, Qiuling Yue, Yuqing Zhang
TL;DR
FDLLM tackles the problem of attributing LLM-generated text to its source under black-box API access by introducing the FD-Dataset and a LoRA-based fingerprinting method. The approach fine-tunes a frozen foundation model with Low-Rank Adaptation to learn deep, persistent fingerprints that cluster outputs by source model while separating different models in representation space. Empirically, FDLLM achieves a Macro F1 score 22.1% higher than the strongest baseline and maintains high accuracy (around 95%) on unseen models, with notable robustness to adversarial edits such as translation, polishing, and synonym substitutions. The work provides a practical, scalable solution for accountability and governance in real-world LLM ecosystems, supported by a comprehensive bilingual benchmark and extensive ablations.
Abstract
Large Language Models (LLMs) are rapidly transforming the landscape of digital content creation. However, the prevalent black-box Application Programming Interface (API) access to many LLMs introduces significant challenges in accountability, governance, and security. LLM fingerprinting, which aims to identify the source model by analyzing statistical and stylistic features of generated text, offers a potential solution. Current progress in this area is hindered by a lack of dedicated datasets and the need for efficient, practical methods that are robust against adversarial manipulations. To address these challenges, we introduce FD-Dataset, a comprehensive bilingual fingerprinting benchmark comprising 90,000 text samples from 20 famous proprietary and open-source LLMs. Furthermore, we present FDLLM, a novel fingerprinting method that leverages parameter-efficient Low-Rank Adaptation (LoRA) to fine-tune a foundation model. This approach enables LoRA to extract deep, persistent features that characterize each source LLM. Through our analysis, we find that LoRA adaptation promotes the aggregation of outputs from the same LLM in representation space while enhancing the separation between different LLMs. This mechanism explains why LoRA proves particularly effective for LLM fingerprinting. Extensive empirical evaluations on FD-Dataset demonstrate FDLLM's superiority, achieving a Macro F1 score 22.1% higher than the strongest baseline. FDLLM also exhibits strong generalization to newly released models, achieving an average accuracy of 95% on unseen models. Notably, FDLLM remains consistently robust under various adversarial attacks, including polishing, translation, and synonym substitution. Experimental results show that FDLLM reduces the average attack success rate from 49.2% (LM-D) to 23.9%.
