Table of Contents
Fetching ...

Explaining the role of Intrinsic Dimensionality in Adversarial Training

Enes Altinisik, Safa Messaoud, Husrev Taha Sencar, Hassan Sajjad, Sanjay Chawla

TL;DR

This work explains why adversarial training yields different robustness and generalization outcomes across vision models, encoder-based LLMs, and decoder-based LLMs by linking these trends to layerwise intrinsic dimensionality (ID) and the manifold conjecture. It shows that off-manifold adversarial examples (OFM-AEs) promote robustness, while on-manifold adversarial examples (ONM-AEs) promote generalization, with the layer-specific ID shaping the ONM/OFM mix. Building on this, the authors introduce SMAAT, which perturbs the layer with the lowest ID to generate OFM-AEs efficiently, reducing the PGD chain length and achieving faster training while boosting robustness across sentiment classification, safety filtering, and RAG retrieval tasks. Empirical results demonstrate that SMAAT delivers superior robustness with comparable generalization to standard training and offers substantial runtime savings, making AT more practical for encoder-based models in real-world pipelines. The findings offer a principled path to balance robustness and generalization by controlling perturbations across intermediate layers, with potential for broader adoption in diverse AI systems.

Abstract

Adversarial Training (AT) impacts different architectures in distinct ways: vision models gain robustness but face reduced generalization, encoder-based models exhibit limited robustness improvements with minimal generalization loss, and recent work in latent-space adversarial training (LAT) demonstrates that decoder-based models achieve improved robustness by applying AT across multiple layers. We provide the first explanation for these trends by leveraging the manifold conjecture: off-manifold adversarial examples (AEs) enhance robustness, while on-manifold AEs improve generalization. We show that vision and decoder-based models exhibit low intrinsic dimensionality in earlier layers (favoring off-manifold AEs), whereas encoder-based models do so in later layers (favoring on-manifold AEs). Exploiting this property, we introduce SMAAT, which improves the scalability of AT for encoder-based models by perturbing the layer with the lowest intrinsic dimensionality. This reduces the projected gradient descent (PGD) chain length required for AE generation, cutting GPU time by 25-33% while significantly boosting robustness. We validate SMAAT across multiple tasks, including text generation, sentiment classification, safety filtering, and retrieval augmented generation setups, demonstrating superior robustness with comparable generalization to standard training.

Explaining the role of Intrinsic Dimensionality in Adversarial Training

TL;DR

This work explains why adversarial training yields different robustness and generalization outcomes across vision models, encoder-based LLMs, and decoder-based LLMs by linking these trends to layerwise intrinsic dimensionality (ID) and the manifold conjecture. It shows that off-manifold adversarial examples (OFM-AEs) promote robustness, while on-manifold adversarial examples (ONM-AEs) promote generalization, with the layer-specific ID shaping the ONM/OFM mix. Building on this, the authors introduce SMAAT, which perturbs the layer with the lowest ID to generate OFM-AEs efficiently, reducing the PGD chain length and achieving faster training while boosting robustness across sentiment classification, safety filtering, and RAG retrieval tasks. Empirical results demonstrate that SMAAT delivers superior robustness with comparable generalization to standard training and offers substantial runtime savings, making AT more practical for encoder-based models in real-world pipelines. The findings offer a principled path to balance robustness and generalization by controlling perturbations across intermediate layers, with potential for broader adoption in diverse AI systems.

Abstract

Adversarial Training (AT) impacts different architectures in distinct ways: vision models gain robustness but face reduced generalization, encoder-based models exhibit limited robustness improvements with minimal generalization loss, and recent work in latent-space adversarial training (LAT) demonstrates that decoder-based models achieve improved robustness by applying AT across multiple layers. We provide the first explanation for these trends by leveraging the manifold conjecture: off-manifold adversarial examples (AEs) enhance robustness, while on-manifold AEs improve generalization. We show that vision and decoder-based models exhibit low intrinsic dimensionality in earlier layers (favoring off-manifold AEs), whereas encoder-based models do so in later layers (favoring on-manifold AEs). Exploiting this property, we introduce SMAAT, which improves the scalability of AT for encoder-based models by perturbing the layer with the lowest intrinsic dimensionality. This reduces the projected gradient descent (PGD) chain length required for AE generation, cutting GPU time by 25-33% while significantly boosting robustness. We validate SMAAT across multiple tasks, including text generation, sentiment classification, safety filtering, and retrieval augmented generation setups, demonstrating superior robustness with comparable generalization to standard training.
Paper Structure (16 sections, 9 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 16 sections, 9 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Impact of applying LAT at different layers of the LLaMA-2 model, illustrating the relationship between Intrinsic Dimensionality (background color), Generalization (blue), Robustness (orange), and Off-Manifold Ratio (green, based on reconstruction error). Markers show the average of measured values across multiple training configurations; lines depict overall trends. The off-manifold ratio measures the percentage of adversarial examples that fall outside the data manifold using reconstruction error. As we move to deeper layers, the Intrinsic Dimensionality increases, resulting in a decrease in the off-manifold ratio. According to the manifold conjecture, this leads to an increase in generalization (more on-manifold samples) and a decrease in robustness.
  • Figure 2: Comparison of SMAAT robustness (x-axis), generalization (y-axis), and run time (marker size) against baselines for robustifying (a) topic classifiers, (b) retriever models in the Retrieval Augmented Generation (RAG) setup and (c) safety filters for decoder-based LLMs. SMAAT significantly enhances model robustness compared to seven different baselines, while maintaining nearly the same clean accuracy. Besides, it is significantly more scalable than AT (marker size).
  • Figure 3: Left: In classical AT, adversarial examples (AEs) are created in the data layer. For encoder LLMs, the intrinsic dimensionality tends to be high in the initial layers and therefore the AEs created tend to be on-manifold which results in better generalization. In Vision and decoder LLMs, we observe the opposite behavior and AEs tend to be off-manifold resulting in better robustness. Right: The key idea of SMAAT is to create AEs in intermediate layers where the intrinsic dimensionality is low and AEs will tend to be off-manifold. This results in better robustness while (surprisingly) maintaining generalization. The speed-up in SMAAT is due to the fact we need shorter backprop chains to create AEs in intermediate layers.
  • Figure 4: The ID (row 1) trend follows the inverse OFM-/ONM-AEs ratio (row 2) trend. The average projection error ($e_l^k$) is used as a proxy for estimating the OFM-/ONM- AEs ratio. The ID is computed using the twoNN approach. Enc-LLMs (BERT, RoBERTa) have decreasing ID and OFM-AEs proportions trends unlike vision models and dec-LLMs that have increasing ID and ONM-AEs trends.
  • Figure 5: Layer-wise effects of adversarial training on robustness and generalization across model types. Each subplot shows the impact of applying AT at different layers of (a) LLaMA-2 (dec-LLM), (b) BERT (enc-LLM), and (c) VGG (vision model). Marker colors transition from light blue (lower layers) to dark blue (higher layers). Observed trends align with changes in intrinsic dimensionality and the distribution of on- vs. off-manifold adversarial examples, as shown in Figure \ref{['fig:k_effect']}.
  • ...and 1 more figures