Breaking the Adversarial Robustness-Performance Trade-off in Text Classification via Manifold Purification
Chenhao Dang, Jing Ma
TL;DR
This work tackles the robustness-accuracy dilemma in text classification by exploiting the embedding-space geometry of PLMs. It introduces MC^2F, a two-module framework that first learns a stratified Riemannian manifold of clean embeddings with a SR-CNF for detection, then purifies adversarial embeddings by projecting them along geodesics onto the clean manifold using a Geodesic Purification Solver. The training objective combines density estimation, topological preservation, and causal-semantic regularization to ensure robust, semantically faithful purification. Across SST-2, AGNews, and YELP, MC^2F achieves state-of-the-art adversarial robustness while maintaining or even improving performance on clean data, illustrating the practical value of a geometry-guided defense. The approach offers a principled pathway to deploy NLP systems with reliable performance in adversarial settings.
Abstract
A persistent challenge in text classification (TC) is that enhancing model robustness against adversarial attacks typically degrades performance on clean data. We argue that this challenge can be resolved by modeling the distribution of clean samples in the encoder embedding manifold. To this end, we propose the Manifold-Correcting Causal Flow (MC^2F), a two-module system that operates directly on sentence embeddings. A Stratified Riemannian Continuous Normalizing Flow (SR-CNF) learns the density of the clean data manifold. It identifies out-of-distribution embeddings, which are then corrected by a Geodesic Purification Solver. This solver projects adversarial points back onto the learned manifold via the shortest path, restoring a clean, semantically coherent representation. We conducted extensive evaluations on text classification (TC) across three datasets and multiple adversarial attacks. The results demonstrate that our method, MC^2F, not only establishes a new state-of-the-art in adversarial robustness but also fully preserves performance on clean data, even yielding modest gains in accuracy.
