Table of Contents
Fetching ...

SPOT: Text Source Prediction from Originality Score Thresholding

Edouard Yvinec, Gabriel Kasser

TL;DR

SPOT reframes trust in text as origin prediction, leveraging an originality score derived from a reference LLM's next-token predictions to distinguish human versus LLM-generated text. The method uses a thresholding rule, $\mathcal{O}(t) > \rho(\tilde{F})$, to classify text sources and is computationally efficient, requiring only a single forward pass per token. Across diverse datasets and model families, SPOT shows strong robustness to architecture, training data, domain, and compression, with large gaps between human and LLM originality on general data, though it struggles for domain-specialized tasks like coding and mathematics when evaluated with fine-tuned models. The work highlights practical trust-based defenses against LLM-generated content, while acknowledging limitations related to scale, mixed-source texts, and deployment nuances that merit further study.

Abstract

The wide acceptance of large language models (LLMs) has unlocked new applications and social risks. Popular countermeasures aim at detecting misinformation, usually involve domain specific models trained to recognize the relevance of any information. Instead of evaluating the validity of the information, we propose to investigate LLM generated text from the perspective of trust. In this study, we define trust as the ability to know if an input text was generated by a LLM or a human. To do so, we design SPOT, an efficient method, that classifies the source of any, standalone, text input based on originality score. This score is derived from the prediction of a given LLM to detect other LLMs. We empirically demonstrate the robustness of the method to the architecture, training data, evaluation data, task and compression of modern LLMs.

SPOT: Text Source Prediction from Originality Score Thresholding

TL;DR

SPOT reframes trust in text as origin prediction, leveraging an originality score derived from a reference LLM's next-token predictions to distinguish human versus LLM-generated text. The method uses a thresholding rule, , to classify text sources and is computationally efficient, requiring only a single forward pass per token. Across diverse datasets and model families, SPOT shows strong robustness to architecture, training data, domain, and compression, with large gaps between human and LLM originality on general data, though it struggles for domain-specialized tasks like coding and mathematics when evaluated with fine-tuned models. The work highlights practical trust-based defenses against LLM-generated content, while acknowledging limitations related to scale, mixed-source texts, and deployment nuances that merit further study.

Abstract

The wide acceptance of large language models (LLMs) has unlocked new applications and social risks. Popular countermeasures aim at detecting misinformation, usually involve domain specific models trained to recognize the relevance of any information. Instead of evaluating the validity of the information, we propose to investigate LLM generated text from the perspective of trust. In this study, we define trust as the ability to know if an input text was generated by a LLM or a human. To do so, we design SPOT, an efficient method, that classifies the source of any, standalone, text input based on originality score. This score is derived from the prediction of a given LLM to detect other LLMs. We empirically demonstrate the robustness of the method to the architecture, training data, evaluation data, task and compression of modern LLMs.
Paper Structure (25 sections, 4 equations, 4 figures, 19 tables)

This paper contains 25 sections, 4 equations, 4 figures, 19 tables.

Figures (4)

  • Figure 1: Comparison between original text (human generated) and synthetic text (LLM generated) through the lens of the proposed SPOT method, which uses an OPT 7b model. On average, the average score obtained with SPOT is 20 times larger on human sources than on LLMs. These results were obtained on the Wikipedia training set and a common context of 24 tokens.
  • Figure 2: Distribution density of the originality score $\mathcal{O}$ for a subset of 10000 training examples from Wikipedia for both human and Llama 7B evaluated by Falcon 7B, with a context size of 24 tokens and 128 tokens long samplings. The distributions are obtained with Gaussian kernels of std $0.001$. We report the threshold for 95% of the scores' threshold, which highlights the clear split between human and LLMs writing through the proposed metric.
  • Figure 3: Distribution density of the originality score $\mathcal{O}$ for a subset of 10000 training examples from Wikipedia for both human and Llama 7B evaluated by Falcon 7B, with a context size of 24 tokens and 768 tokens long samplings.
  • Figure 4: Distribution density of the originality score $\mathcal{O}$ for a subset of 10000 training examples from Wikipedia for both human and Llama 7B evaluated by Falcon 7B, with a context size of 512 tokens and 768 tokens long samplings.