Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders
Kristian Kuznetsov, Laida Kushnareva, Polina Druzhinina, Anton Razzhigaev, Anastasia Voznyuk, Irina Piontkovskaya, Evgeny Burnaev, Serguei Barannikov
TL;DR
This work advances artificial text detection (ATD) interpretability by applying Sparse Autoencoders to Gemma-2-2B residual streams to extract sparse, interpretable features. The authors train SAE detectors per layer, aggregate token-level signals, and evaluate using XGBoost and threshold classifiers on the COLING GenAI Task 1 data and the RAID dataset, incorporating feature steering and GPT-4o-based interpretation. They find that SAE features, especially from layer 16, can outperform activation baselines and even some state-of-the-art models, while revealing generalizable and domain-specific patterns tied to complexity, repetition, and formality. The study highlights how detectors generalize to unseen models and prompt styles, offering actionable insights for designing robust, interpretable ATD systems and guiding future research on model-agnostic detection strategies.
Abstract
Artificial Text Detection (ATD) is becoming increasingly important with the rise of advanced Large Language Models (LLMs). Despite numerous efforts, no single algorithm performs consistently well across different types of unseen text or guarantees effective generalization to new LLMs. Interpretability plays a crucial role in achieving this goal. In this study, we enhance ATD interpretability by using Sparse Autoencoders (SAE) to extract features from Gemma-2-2b residual stream. We identify both interpretable and efficient features, analyzing their semantics and relevance through domain- and model-specific statistics, a steering approach, and manual or LLM-based interpretation. Our methods offer valuable insights into how texts from various models differ from human-written content. We show that modern LLMs have a distinct writing style, especially in information-dense domains, even though they can produce human-like outputs with personalized prompts.
