Table of Contents
Fetching ...

FDLLM: A Dedicated Detector for Black-Box LLMs Fingerprinting

Zhiyuan Fu, Junfan Chen, Lan Zhang, Ting Yang, Jun Niu, Hongyu Sun, Ruidong Li, Peng Liu, Jice Wang, Fannv He, Qiuling Yue, Yuqing Zhang

TL;DR

FDLLM tackles the problem of attributing LLM-generated text to its source under black-box API access by introducing the FD-Dataset and a LoRA-based fingerprinting method. The approach fine-tunes a frozen foundation model with Low-Rank Adaptation to learn deep, persistent fingerprints that cluster outputs by source model while separating different models in representation space. Empirically, FDLLM achieves a Macro F1 score 22.1% higher than the strongest baseline and maintains high accuracy (around 95%) on unseen models, with notable robustness to adversarial edits such as translation, polishing, and synonym substitutions. The work provides a practical, scalable solution for accountability and governance in real-world LLM ecosystems, supported by a comprehensive bilingual benchmark and extensive ablations.

Abstract

Large Language Models (LLMs) are rapidly transforming the landscape of digital content creation. However, the prevalent black-box Application Programming Interface (API) access to many LLMs introduces significant challenges in accountability, governance, and security. LLM fingerprinting, which aims to identify the source model by analyzing statistical and stylistic features of generated text, offers a potential solution. Current progress in this area is hindered by a lack of dedicated datasets and the need for efficient, practical methods that are robust against adversarial manipulations. To address these challenges, we introduce FD-Dataset, a comprehensive bilingual fingerprinting benchmark comprising 90,000 text samples from 20 famous proprietary and open-source LLMs. Furthermore, we present FDLLM, a novel fingerprinting method that leverages parameter-efficient Low-Rank Adaptation (LoRA) to fine-tune a foundation model. This approach enables LoRA to extract deep, persistent features that characterize each source LLM. Through our analysis, we find that LoRA adaptation promotes the aggregation of outputs from the same LLM in representation space while enhancing the separation between different LLMs. This mechanism explains why LoRA proves particularly effective for LLM fingerprinting. Extensive empirical evaluations on FD-Dataset demonstrate FDLLM's superiority, achieving a Macro F1 score 22.1% higher than the strongest baseline. FDLLM also exhibits strong generalization to newly released models, achieving an average accuracy of 95% on unseen models. Notably, FDLLM remains consistently robust under various adversarial attacks, including polishing, translation, and synonym substitution. Experimental results show that FDLLM reduces the average attack success rate from 49.2% (LM-D) to 23.9%.

FDLLM: A Dedicated Detector for Black-Box LLMs Fingerprinting

TL;DR

FDLLM tackles the problem of attributing LLM-generated text to its source under black-box API access by introducing the FD-Dataset and a LoRA-based fingerprinting method. The approach fine-tunes a frozen foundation model with Low-Rank Adaptation to learn deep, persistent fingerprints that cluster outputs by source model while separating different models in representation space. Empirically, FDLLM achieves a Macro F1 score 22.1% higher than the strongest baseline and maintains high accuracy (around 95%) on unseen models, with notable robustness to adversarial edits such as translation, polishing, and synonym substitutions. The work provides a practical, scalable solution for accountability and governance in real-world LLM ecosystems, supported by a comprehensive bilingual benchmark and extensive ablations.

Abstract

Large Language Models (LLMs) are rapidly transforming the landscape of digital content creation. However, the prevalent black-box Application Programming Interface (API) access to many LLMs introduces significant challenges in accountability, governance, and security. LLM fingerprinting, which aims to identify the source model by analyzing statistical and stylistic features of generated text, offers a potential solution. Current progress in this area is hindered by a lack of dedicated datasets and the need for efficient, practical methods that are robust against adversarial manipulations. To address these challenges, we introduce FD-Dataset, a comprehensive bilingual fingerprinting benchmark comprising 90,000 text samples from 20 famous proprietary and open-source LLMs. Furthermore, we present FDLLM, a novel fingerprinting method that leverages parameter-efficient Low-Rank Adaptation (LoRA) to fine-tune a foundation model. This approach enables LoRA to extract deep, persistent features that characterize each source LLM. Through our analysis, we find that LoRA adaptation promotes the aggregation of outputs from the same LLM in representation space while enhancing the separation between different LLMs. This mechanism explains why LoRA proves particularly effective for LLM fingerprinting. Extensive empirical evaluations on FD-Dataset demonstrate FDLLM's superiority, achieving a Macro F1 score 22.1% higher than the strongest baseline. FDLLM also exhibits strong generalization to newly released models, achieving an average accuracy of 95% on unseen models. Notably, FDLLM remains consistently robust under various adversarial attacks, including polishing, translation, and synonym substitution. Experimental results show that FDLLM reduces the average attack success rate from 49.2% (LM-D) to 23.9%.

Paper Structure

This paper contains 24 sections, 12 equations, 12 figures, 13 tables, 2 algorithms.

Figures (12)

  • Figure 1: The overall framework of the article. The FDLLM framework consists of two phases. Phase 1: Constructing Dataset. Seed prompts are built and cleaned in both English and Chinese, filtered using LLMs, and checked for availability to produce a large-scale multilingual dataset from 20 LLMs. Phase 2: Learning Fingerprints. Input text is evaluated by the LoRA fine-tuned FDLLM model to extract discriminative features for fingerprinting detection.
  • Figure 2: Word count distributions for Chinese and English texts generated by LLMs at different temperature settings.
  • Figure 3: The Scenario of Robustness Threat.
  • Figure 4: Confusion matrix illustrating the classification performance of FDLLM. Values less than 0.01 are not shown for better visualization.
  • Figure 5: In each panel, light-gray markers show the original embeddings produced by the frozen base/backbone model (Base Points); the dark-gray symbol (circle in a diamond in b) marks their centroid (Base Centroid). The colored poly-line traces the low-rank displacement of the centroid as training proceeds from the start ($0$) to one full epoch ($1$); numeric call-outs (0.2–1.0) indicate the fraction of the epoch completed. Blue (a) / Yellow (b) markers depict the embedding distribution at the end of the first epoch (Epoch 1 Points), while the solid dark-blue / dark-yellow symbol marks the corresponding updated centroid (Epoch 1 centroid). In both cases, a single epoch of LoRA tuning smoothly steers the centroid into the new cluster and noticeably tightens intra-class dispersion, illustrating how a low-rank update rapidly enhances linear separability.
  • ...and 7 more figures