Table of Contents
Fetching ...

Fingerprinting LLMs via Prompt Injection

Yuepeng Hu, Zhengyuan Jiang, Mengyuan Li, Osama Ahmed, Zhicong Huang, Cheng Hong, Neil Gong

TL;DR

Provenance of released LLMs becomes challenging after post-processing, as post-training and quantization obscure lineage. LLMPrint exploits the inherent vulnerability of LLMs to prompt injection by optimizing fingerprint prompts that bias the first-token choice between token pairs $(w_j^{+}, w_j^{-})$, effectively creating a discriminative, model-specific fingerprint. A unified gray-box/black-box verification framework compares reference bits from the base model to bits inferred from a suspect, using a Gaussian-calibrated threshold $\tau = \mu + z\sigma$ with $z=1.64$. Across five base models and roughly 700 suspect variants, LLMPrint achieves high true positive rates with near-zero false positives, outperforming prior methods and demonstrating robustness to post-processing while remaining practical under API constraints.

Abstract

Large language models (LLMs) are often modified after release through post-processing such as post-training or quantization, which makes it challenging to determine whether one model is derived from another. Existing provenance detection methods have two main limitations: (1) they embed signals into the base model before release, which is infeasible for already published models, or (2) they compare outputs across models using hand-crafted or random prompts, which are not robust to post-processing. In this work, we propose LLMPrint, a novel detection framework that constructs fingerprints by exploiting LLMs' inherent vulnerability to prompt injection. Our key insight is that by optimizing fingerprint prompts to enforce consistent token preferences, we can obtain fingerprints that are both unique to the base model and robust to post-processing. We further develop a unified verification procedure that applies to both gray-box and black-box settings, with statistical guarantees. We evaluate LLMPrint on five base models and around 700 post-trained or quantized variants. Our results show that LLMPrint achieves high true positive rates while keeping false positive rates near zero.

Fingerprinting LLMs via Prompt Injection

TL;DR

Provenance of released LLMs becomes challenging after post-processing, as post-training and quantization obscure lineage. LLMPrint exploits the inherent vulnerability of LLMs to prompt injection by optimizing fingerprint prompts that bias the first-token choice between token pairs , effectively creating a discriminative, model-specific fingerprint. A unified gray-box/black-box verification framework compares reference bits from the base model to bits inferred from a suspect, using a Gaussian-calibrated threshold with . Across five base models and roughly 700 suspect variants, LLMPrint achieves high true positive rates with near-zero false positives, outperforming prior methods and demonstrating robustness to post-processing while remaining practical under API constraints.

Abstract

Large language models (LLMs) are often modified after release through post-processing such as post-training or quantization, which makes it challenging to determine whether one model is derived from another. Existing provenance detection methods have two main limitations: (1) they embed signals into the base model before release, which is infeasible for already published models, or (2) they compare outputs across models using hand-crafted or random prompts, which are not robust to post-processing. In this work, we propose LLMPrint, a novel detection framework that constructs fingerprints by exploiting LLMs' inherent vulnerability to prompt injection. Our key insight is that by optimizing fingerprint prompts to enforce consistent token preferences, we can obtain fingerprints that are both unique to the base model and robust to post-processing. We further develop a unified verification procedure that applies to both gray-box and black-box settings, with statistical guarantees. We evaluate LLMPrint on five base models and around 700 post-trained or quantized variants. Our results show that LLMPrint achieves high true positive rates while keeping false positive rates near zero.

Paper Structure

This paper contains 20 sections, 3 equations, 2 figures, 8 tables, 2 algorithms.

Figures (2)

  • Figure 1: Overview of LLMPrint.
  • Figure 2: Ablation studies of LLMPrint on Meta-Llama-3-8B. Results are reported on post-trained suspect models.