Table of Contents
Fetching ...

Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders

Kristian Kuznetsov, Laida Kushnareva, Polina Druzhinina, Anton Razzhigaev, Anastasia Voznyuk, Irina Piontkovskaya, Evgeny Burnaev, Serguei Barannikov

TL;DR

This work advances artificial text detection (ATD) interpretability by applying Sparse Autoencoders to Gemma-2-2B residual streams to extract sparse, interpretable features. The authors train SAE detectors per layer, aggregate token-level signals, and evaluate using XGBoost and threshold classifiers on the COLING GenAI Task 1 data and the RAID dataset, incorporating feature steering and GPT-4o-based interpretation. They find that SAE features, especially from layer 16, can outperform activation baselines and even some state-of-the-art models, while revealing generalizable and domain-specific patterns tied to complexity, repetition, and formality. The study highlights how detectors generalize to unseen models and prompt styles, offering actionable insights for designing robust, interpretable ATD systems and guiding future research on model-agnostic detection strategies.

Abstract

Artificial Text Detection (ATD) is becoming increasingly important with the rise of advanced Large Language Models (LLMs). Despite numerous efforts, no single algorithm performs consistently well across different types of unseen text or guarantees effective generalization to new LLMs. Interpretability plays a crucial role in achieving this goal. In this study, we enhance ATD interpretability by using Sparse Autoencoders (SAE) to extract features from Gemma-2-2b residual stream. We identify both interpretable and efficient features, analyzing their semantics and relevance through domain- and model-specific statistics, a steering approach, and manual or LLM-based interpretation. Our methods offer valuable insights into how texts from various models differ from human-written content. We show that modern LLMs have a distinct writing style, especially in information-dense domains, even though they can produce human-like outputs with personalized prompts.

Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders

TL;DR

This work advances artificial text detection (ATD) interpretability by applying Sparse Autoencoders to Gemma-2-2B residual streams to extract sparse, interpretable features. The authors train SAE detectors per layer, aggregate token-level signals, and evaluate using XGBoost and threshold classifiers on the COLING GenAI Task 1 data and the RAID dataset, incorporating feature steering and GPT-4o-based interpretation. They find that SAE features, especially from layer 16, can outperform activation baselines and even some state-of-the-art models, while revealing generalizable and domain-specific patterns tied to complexity, repetition, and formality. The study highlights how detectors generalize to unseen models and prompt styles, offering actionable insights for designing robust, interpretable ATD systems and guiding future research on model-agnostic detection strategies.

Abstract

Artificial Text Detection (ATD) is becoming increasingly important with the rise of advanced Large Language Models (LLMs). Despite numerous efforts, no single algorithm performs consistently well across different types of unseen text or guarantees effective generalization to new LLMs. Interpretability plays a crucial role in achieving this goal. In this study, we enhance ATD interpretability by using Sparse Autoencoders (SAE) to extract features from Gemma-2-2b residual stream. We identify both interpretable and efficient features, analyzing their semantics and relevance through domain- and model-specific statistics, a steering approach, and manual or LLM-based interpretation. Our methods offer valuable insights into how texts from various models differ from human-written content. We show that modern LLMs have a distinct writing style, especially in information-dense domains, even though they can produce human-like outputs with personalized prompts.

Paper Structure

This paper contains 18 sections, 4 equations, 17 figures, 7 tables.

Figures (17)

  • Figure 1: Interpretations of one of the most "universal" SAE features that are useful for ATD task.
  • Figure 2: Macro F1 for XGBoost model on activations and SAE-derived features on different subsets of COLING
  • Figure 3: Macro F1 for a threshold classifier on individual features across each model for the 16th layer. Max F1 presents the maximum F1 score for every feature; features 3608 and 4645 are considered general features
  • Figure 4: F1 Macro by the domains subsets for some general and domain-specific features for the 16 layer
  • Figure 5: Machine-generated text samples, various models, anomalous punctuation
  • ...and 12 more figures