Robust AI-Generated Text Detection by Restricted Embeddings
Kristian Kuznetsov, Eduard Tulchinskii, Laida Kushnareva, German Magai, Serguei Barannikov, Sergey Nikolenko, Irina Piontkovskaya
TL;DR
This work tackles the robustness of ATD detectors under cross-domain and cross-generator shifts by restricting detectors to residual subspaces in Transformer embeddings. It introduces three residual-subspace strategies—coordinate-based embedding removal, head-wise decomposition, and LEACE-based concept erasure—to suppress domain-specific cues while preserving generalizable features. Empirical results on SemEval-2024 and GPT-3D show consistent cross-domain/model improvements, with head pruning delivering the largest gains (up to +9% for RoBERTa and +14% for BERT) and concept erasure providing targeted benefits by erasing syntax-related features. The findings reveal that simple, interpretable subspace removals can outperform PCA-based or more complex approaches, suggesting a practical path to more robust ATD in the face of evolving generators and domains; they also emphasize limitations related to watermarking, model evolution, and the need for interpretability. The authors provide code and data to facilitate replication and further study.
Abstract
Growing amount and quality of AI-generated texts makes detecting such content more difficult. In most real-world scenarios, the domain (style and topic) of generated data and the generator model are not known in advance. In this work, we focus on the robustness of classifier-based detectors of AI-generated text, namely their ability to transfer to unseen generators or semantic domains. We investigate the geometry of the embedding space of Transformer-based text encoders and show that clearing out harmful linear subspaces helps to train a robust classifier, ignoring domain-specific spurious features. We investigate several subspace decomposition and feature selection strategies and achieve significant improvements over state of the art methods in cross-domain and cross-generator transfer. Our best approaches for head-wise and coordinate-based subspace removal increase the mean out-of-distribution (OOD) classification score by up to 9% and 14% in particular setups for RoBERTa and BERT embeddings respectively. We release our code and data: https://github.com/SilverSolver/RobustATD
