SPOT: Text Source Prediction from Originality Score Thresholding
Edouard Yvinec, Gabriel Kasser
TL;DR
SPOT reframes trust in text as origin prediction, leveraging an originality score derived from a reference LLM's next-token predictions to distinguish human versus LLM-generated text. The method uses a thresholding rule, $\mathcal{O}(t) > \rho(\tilde{F})$, to classify text sources and is computationally efficient, requiring only a single forward pass per token. Across diverse datasets and model families, SPOT shows strong robustness to architecture, training data, domain, and compression, with large gaps between human and LLM originality on general data, though it struggles for domain-specialized tasks like coding and mathematics when evaluated with fine-tuned models. The work highlights practical trust-based defenses against LLM-generated content, while acknowledging limitations related to scale, mixed-source texts, and deployment nuances that merit further study.
Abstract
The wide acceptance of large language models (LLMs) has unlocked new applications and social risks. Popular countermeasures aim at detecting misinformation, usually involve domain specific models trained to recognize the relevance of any information. Instead of evaluating the validity of the information, we propose to investigate LLM generated text from the perspective of trust. In this study, we define trust as the ability to know if an input text was generated by a LLM or a human. To do so, we design SPOT, an efficient method, that classifies the source of any, standalone, text input based on originality score. This score is derived from the prediction of a given LLM to detect other LLMs. We empirically demonstrate the robustness of the method to the architecture, training data, evaluation data, task and compression of modern LLMs.
