Table of Contents
Fetching ...

Robust AI-Generated Text Detection by Restricted Embeddings

Kristian Kuznetsov, Eduard Tulchinskii, Laida Kushnareva, German Magai, Serguei Barannikov, Sergey Nikolenko, Irina Piontkovskaya

TL;DR

This work tackles the robustness of ATD detectors under cross-domain and cross-generator shifts by restricting detectors to residual subspaces in Transformer embeddings. It introduces three residual-subspace strategies—coordinate-based embedding removal, head-wise decomposition, and LEACE-based concept erasure—to suppress domain-specific cues while preserving generalizable features. Empirical results on SemEval-2024 and GPT-3D show consistent cross-domain/model improvements, with head pruning delivering the largest gains (up to +9% for RoBERTa and +14% for BERT) and concept erasure providing targeted benefits by erasing syntax-related features. The findings reveal that simple, interpretable subspace removals can outperform PCA-based or more complex approaches, suggesting a practical path to more robust ATD in the face of evolving generators and domains; they also emphasize limitations related to watermarking, model evolution, and the need for interpretability. The authors provide code and data to facilitate replication and further study.

Abstract

Growing amount and quality of AI-generated texts makes detecting such content more difficult. In most real-world scenarios, the domain (style and topic) of generated data and the generator model are not known in advance. In this work, we focus on the robustness of classifier-based detectors of AI-generated text, namely their ability to transfer to unseen generators or semantic domains. We investigate the geometry of the embedding space of Transformer-based text encoders and show that clearing out harmful linear subspaces helps to train a robust classifier, ignoring domain-specific spurious features. We investigate several subspace decomposition and feature selection strategies and achieve significant improvements over state of the art methods in cross-domain and cross-generator transfer. Our best approaches for head-wise and coordinate-based subspace removal increase the mean out-of-distribution (OOD) classification score by up to 9% and 14% in particular setups for RoBERTa and BERT embeddings respectively. We release our code and data: https://github.com/SilverSolver/RobustATD

Robust AI-Generated Text Detection by Restricted Embeddings

TL;DR

This work tackles the robustness of ATD detectors under cross-domain and cross-generator shifts by restricting detectors to residual subspaces in Transformer embeddings. It introduces three residual-subspace strategies—coordinate-based embedding removal, head-wise decomposition, and LEACE-based concept erasure—to suppress domain-specific cues while preserving generalizable features. Empirical results on SemEval-2024 and GPT-3D show consistent cross-domain/model improvements, with head pruning delivering the largest gains (up to +9% for RoBERTa and +14% for BERT) and concept erasure providing targeted benefits by erasing syntax-related features. The findings reveal that simple, interpretable subspace removals can outperform PCA-based or more complex approaches, suggesting a practical path to more robust ATD in the face of evolving generators and domains; they also emphasize limitations related to watermarking, model evolution, and the need for interpretability. The authors provide code and data to facilitate replication and further study.

Abstract

Growing amount and quality of AI-generated texts makes detecting such content more difficult. In most real-world scenarios, the domain (style and topic) of generated data and the generator model are not known in advance. In this work, we focus on the robustness of classifier-based detectors of AI-generated text, namely their ability to transfer to unseen generators or semantic domains. We investigate the geometry of the embedding space of Transformer-based text encoders and show that clearing out harmful linear subspaces helps to train a robust classifier, ignoring domain-specific spurious features. We investigate several subspace decomposition and feature selection strategies and achieve significant improvements over state of the art methods in cross-domain and cross-generator transfer. Our best approaches for head-wise and coordinate-based subspace removal increase the mean out-of-distribution (OOD) classification score by up to 9% and 14% in particular setups for RoBERTa and BERT embeddings respectively. We release our code and data: https://github.com/SilverSolver/RobustATD

Paper Structure

This paper contains 26 sections, 1 theorem, 19 equations, 14 figures, 10 tables.

Key Result

Proposition 1

Let $\{\mathbf{u}_1,\dots, \mathbf{u}_d\}$ be the principal components of a dataset $\mathcal{D}$ with corresponding singular values $\lambda_1, \dots, \lambda_d$ (in descending order). Then the explained variance of a subspace spanned by $d-k$ last principal components $R_k=\langle \mathbf{u}_{k+1} Moreover, $R_k$ has the minimal explained variance among all $(d-k)$-dimensional subspaces.

Figures (14)

  • Figure 1: Mean accuracy in cross-domain (left) and cross-model ATD by RoBERTa-base on SemEval
  • Figure 2: Mean accuracy on SemEval with pruned RoBERTa layers. Dashed lines show the baseline.
  • Figure 3: Mean accuracy in cross-domain/cross-model ATD on GPT-3D by: (a) RoBERTa-base, (b) RoBERTa-base with all attention heads pruned from layer 1, (c) RoBERTa with TopConst concept erasure, (d) optimal head removal, (e) best set of coordinates, (f) classifier based on PHD intrinsic dimensions.
  • Figure 4: Score change after concept erasure in cross-domain and cross-model settings on SemEval.
  • Figure 5: Geometric intuition of our approaches.
  • ...and 9 more figures

Theorems & Definitions (4)

  • Definition 1: Subspace explained variance shen2008sparsegandelsman2023interpreting
  • Definition 2
  • Proposition 1
  • proof