Detecting Stylistic Fingerprints of Large Language Models
Yehonatan Bitton, Elad Bitton, Shai Nisan
TL;DR
The paper tackles the problem of attributing AI-generated text to the originating LLM family rather than merely detecting AI authorship. It proposes a cost-sensitive, unanimous ensemble of three diverse classifiers trained on texts from four LLM families (Claude, Gemini, Llama, OpenAI) to achieve high precision and minimal false positives in multiclass attribution. On a 200k-text seen-set, the unanimous ensemble achieves a macro-Fβ(0.5) of 0.9988 with an extremely low false-positive rate of 0.0004, and demonstrates the ability to distinguish between seen and unseen LLMs, albeit with high no-agreement rates for models outside the training space. The results advance AI-generated text verification and IP protection, while highlighting limitations in fingerprint coverage and suggesting directions for broader model inclusion, language support, and richer feature-level explainability in future work.
Abstract
Large language models (LLMs) have distinct and consistent stylistic fingerprints, even when prompted to write in different writing styles. Detecting these fingerprints is important for many reasons, among them protecting intellectual property, ensuring transparency regarding AI-generated content, and preventing the misuse of AI technologies. In this paper, we present a novel method to classify texts based on the stylistic fingerprints of the models that generated them. We introduce an LLM-detection ensemble that is composed of three classifiers with varied architectures and training data. This ensemble is trained to classify texts generated by four well-known LLM families: Claude, Gemini, Llama, and OpenAI. As this task is highly cost-sensitive and might have severe implications, we want to minimize false-positives and increase confidence. We consider a prediction as valid when all three classifiers in the ensemble unanimously agree on the output classification. Our ensemble is validated on a test set of texts generated by Claude, Gemini, Llama, and OpenAI models, and achieves extremely high precision (0.9988) and a very low false-positive rate (0.0004). Furthermore, we demonstrate the ensemble's ability to distinguish between texts generated by seen and unseen models. This reveals interesting stylistic relationships between models. This approach to stylistic analysis has implications for verifying the originality of AI-generated texts and tracking the origins of model training techniques.
