Table of Contents
Fetching ...

Your Large Language Models Are Leaving Fingerprints

Hope McGovern, Rickard Stureborg, Yoshi Suhara, Dimitris Alikaniotis

TL;DR

This work reframes machine-generated text detection as an authorship-identification problem and introduces LLM fingerprints—distinct lexical and morphosyntactic patterns left by models. By combining word, character, and POS n-gram features with a GradientBoost classifier, the authors achieve strong cross-domain and multiclass detection across five GTD datasets, often rivaling neural detectors. Fingerprints prove to be robust within model families and across domains, though they can be altered by targeted fine-tuning or instruction tuning, and are less transferable across different model families. The findings advocate for simple, interpretable feature-based baselines as reliable detectors and offer a new lens on evaluating and understanding LLM-generated text across real-world contexts.

Abstract

It has been shown that finetuned transformers and other supervised detectors effectively distinguish between human and machine-generated text in some situations arXiv:2305.13242, but we find that even simple classifiers on top of n-gram and part-of-speech features can achieve very robust performance on both in- and out-of-domain data. To understand how this is possible, we analyze machine-generated output text in five datasets, finding that LLMs possess unique fingerprints that manifest as slight differences in the frequency of certain lexical and morphosyntactic features. We show how to visualize such fingerprints, describe how they can be used to detect machine-generated text and find that they are even robust across textual domains. We find that fingerprints are often persistent across models in the same model family (e.g. llama-13b vs. llama-65b) and that models fine-tuned for chat are easier to detect than standard language models, indicating that LLM fingerprints may be directly induced by the training data.

Your Large Language Models Are Leaving Fingerprints

TL;DR

This work reframes machine-generated text detection as an authorship-identification problem and introduces LLM fingerprints—distinct lexical and morphosyntactic patterns left by models. By combining word, character, and POS n-gram features with a GradientBoost classifier, the authors achieve strong cross-domain and multiclass detection across five GTD datasets, often rivaling neural detectors. Fingerprints prove to be robust within model families and across domains, though they can be altered by targeted fine-tuning or instruction tuning, and are less transferable across different model families. The findings advocate for simple, interpretable feature-based baselines as reliable detectors and offer a new lens on evaluating and understanding LLM-generated text across real-world contexts.

Abstract

It has been shown that finetuned transformers and other supervised detectors effectively distinguish between human and machine-generated text in some situations arXiv:2305.13242, but we find that even simple classifiers on top of n-gram and part-of-speech features can achieve very robust performance on both in- and out-of-domain data. To understand how this is possible, we analyze machine-generated output text in five datasets, finding that LLMs possess unique fingerprints that manifest as slight differences in the frequency of certain lexical and morphosyntactic features. We show how to visualize such fingerprints, describe how they can be used to detect machine-generated text and find that they are even robust across textual domains. We find that fingerprints are often persistent across models in the same model family (e.g. llama-13b vs. llama-65b) and that models fine-tuned for chat are easier to detect than standard language models, indicating that LLM fingerprints may be directly induced by the training data.
Paper Structure (27 sections, 7 figures, 15 tables)

This paper contains 27 sections, 7 figures, 15 tables.

Figures (7)

  • Figure 1: Visualization of the fingerprints. We plot frequencies of each part-of-speech (POS) class from the output of several models, sorted by model family. Within each family, the shapes (distributions) look mostly similar regardless of model size. Each radial plot is shown at the same $0\%$ to $20\%$ frequency scale, with POS tags sorted from most to least common among human-written outputs. Jagged/bumpy shapes indicate the fingerprint is more distinct from human distributions. POS is just one component of the full 'fingerprint' we investigate.
  • Figure 2: F1 score of GTD on in-domain versus out-of-domain test sets for the largest model of each model family in the Deepfake benchmark. We find no statistically significant drop in performance when testing on these 7 models' outputs. 95% confidence intervals are computed through bootstrap sampling at $n=10,000$.
  • Figure 3: Average drop in performance on various metrics when testing on out-of-domain text (blue) versus a held-out generative model (brown). Note that recall of the machine-generated text drops significantly when testing on an unseen model's output, while changing the domain has no impact on this metric.
  • Figure 4: Absolute difference in POS tag frequencies as compared with human text. Chat models are slightly more similar to the frequency profile of humans, but are easier to detect than base models. This demonstrates that fingerprints "closer" to human distributions in POS tags does not indicate it is less detectable. Further, fine-tuning models for chat clearly alters their fingerprint despite no change in model architecture.
  • Figure 5: Additional visualizations of fingerprints. Note that the POS tag distributions of OPT models are less similar than we observe within other model families. Further investigations could examine what causes these differences, since model size seems to not play a factor in FLAN models.
  • ...and 2 more figures