Table of Contents
Fetching ...

Efficient Semi-Supervised Adversarial Training via Latent Clustering-Based Data Reduction

Somrita Ghosh, Yuelin Xu, Xiao Zhang

TL;DR

Efficient SSAT reduces unlabeled data and training time by prioritizing boundary-adjacent samples through latent-space clustering and guided diffusion. It introduces LCS-KM, LCS-GMM, and diffusion-guided variants (PCG, LCG-KM, LCG-GMM) to maintain robustness with far fewer unlabeled samples. Empirical results on SVHN, CIFAR-10, and medical data show robust accuracy close to full-data baselines with 5–10x unlabeled data reductions and 4x faster runtimes. The approach demonstrates practical scalability and provides a foundation for more data-efficient robust learning.

Abstract

Achieving high model robustness under adversarial settings is widely recognized as demanding considerable training samples. Recent works propose semi-supervised adversarial training (SSAT) methods with external unlabeled or synthetically generated data, which are the current state-of-the-art. However, SSAT requires substantial extra data to attain high robustness, resulting in prolonged training time and increased memory usage. In this paper, we propose unlabeled data reduction strategies to improve the efficiency of SSAT. Specifically, we design novel latent clustering-based techniques to select or generate a small critical subset of data samples near the model's decision boundary. While focusing on boundary-adjacent points, our methods maintain a balanced ratio between boundary and non-boundary data points to avoid overfitting. Comprehensive experiments on benchmark datasets demonstrate that our methods can significantly reduce SSAT's data requirement and computation costs while preserving its strong robustness advantages. In particular, our latent-space selection scheme based on k-means clustering and our guided DDPM fine-tuning approach with LCG-KM are the most effective, achieving nearly identical robust accuracies with 5x to 10x less unlabeled data and approximately 4x less total runtime.

Efficient Semi-Supervised Adversarial Training via Latent Clustering-Based Data Reduction

TL;DR

Efficient SSAT reduces unlabeled data and training time by prioritizing boundary-adjacent samples through latent-space clustering and guided diffusion. It introduces LCS-KM, LCS-GMM, and diffusion-guided variants (PCG, LCG-KM, LCG-GMM) to maintain robustness with far fewer unlabeled samples. Empirical results on SVHN, CIFAR-10, and medical data show robust accuracy close to full-data baselines with 5–10x unlabeled data reductions and 4x faster runtimes. The approach demonstrates practical scalability and provides a foundation for more data-efficient robust learning.

Abstract

Achieving high model robustness under adversarial settings is widely recognized as demanding considerable training samples. Recent works propose semi-supervised adversarial training (SSAT) methods with external unlabeled or synthetically generated data, which are the current state-of-the-art. However, SSAT requires substantial extra data to attain high robustness, resulting in prolonged training time and increased memory usage. In this paper, we propose unlabeled data reduction strategies to improve the efficiency of SSAT. Specifically, we design novel latent clustering-based techniques to select or generate a small critical subset of data samples near the model's decision boundary. While focusing on boundary-adjacent points, our methods maintain a balanced ratio between boundary and non-boundary data points to avoid overfitting. Comprehensive experiments on benchmark datasets demonstrate that our methods can significantly reduce SSAT's data requirement and computation costs while preserving its strong robustness advantages. In particular, our latent-space selection scheme based on k-means clustering and our guided DDPM fine-tuning approach with LCG-KM are the most effective, achieving nearly identical robust accuracies with 5x to 10x less unlabeled data and approximately 4x less total runtime.
Paper Structure (31 sections, 16 equations, 10 figures, 6 tables, 3 algorithms)

This paper contains 31 sections, 16 equations, 10 figures, 6 tables, 3 algorithms.

Figures (10)

  • Figure 1: Illustration of standard and robust accuracy curves of SSAT on CIFAR-10 with different configurations of unlabeled data selection from the external $500$K Tiny Images: (a) No extra data, (b) random selection with $\alpha = 10\%$, (c) LCS-KM with $\alpha = 10\%$, and (d) utilizing all $500$K unlabeled data.
  • Figure 2: Standard and robust accuracy curves of SSAT with labeled data from COVIDGRand unlabeled data from CoronaHack, with different selection schemes and ratios: (a) random selection with $\alpha = 10\%$, (b) LCS-KM with $\alpha = 10\%$, (c) LCS-KM with $\alpha = 20\%$, and (d) all unlabeled data $\alpha = 100\%$.
  • Figure 3: Visual comparison of selection techniques on TinyImages dataset in the latent space. Each subplot represents a different method: (a) PCS identifies points with the lowest classification confidence, highlighting areas where the model is most uncertain within the ten-class latent representation, (b) LCS-GMM illustrates probability contours from Gaussian Mixture Models, with selected points emphasizing regions of overlapping probabilities among the ten class clusters, and (c) LCS-KM highlights points selected near decision boundaries across ten classes based on k-means clustering in the latent space.
  • Figure 4: Illustration of SSAT performance using our LCS-KM method on image benchmarks with varying boundary and unlabeled data reduction ratio parameters with respect to: (a) and (c) DDPM-generated CIFAR-10 data with $\alpha = 20\%$, (b) and (d) $531$K extra SVHN data with $\alpha = 10\%$. For each figure, we vary the considered ratio parameter while keeping all the remaining hyperparameters fixed.
  • Figure 5: Clean and robust accuracy curves of SSAT with LCG-KM using generated data on CIFAR-10 under $\alpha = 10\%$ by varying: (a) the number of fine-tuning epochs $S$, and (b) the regularization strength parameter $\lambda$.
  • ...and 5 more figures