Table of Contents
Fetching ...

Generalizing Few Data to Unseen Domains Flexibly Based on Label Smoothing Integrated with Distributionally Robust Optimization

Yangdi Wang, Zhi-Hai Zhang, Su Xiu Xu, Wenming Guo

TL;DR

This work targets overfitting of deep nets trained on small datasets by merging label smoothing with distributionally robust optimization (DRO). It formulates a DRO-LS framework in which LS-induced label regularization couples with a Wasserstein-based ambiguity set to flexibly shift data toward unseen domains, effectively augmenting data without extra annotations. The proposed GI-LS algorithm solves this two-stage problem via inner gradient ascent for generating perturbed samples and outer SGD for training, with convergence guarantees and Bayesian optimization-guided hyperparameter tuning. Empirical results on small-scale anomaly datasets show that GI-LS outperforms standard LS and conventional data-augmentation methods, with robustness to perturbations in the DRO parameter $\gamma$ and strong performance gains on MT-Defect/MT as well as stable gains on Wood/Carpet. The approach offers a principled, scalable path to improve generalization under data scarcity, with practical implications for domains like industrial anomaly detection where data collection is costly.

Abstract

Overfitting commonly occurs when applying deep neural networks (DNNs) on small-scale datasets, where DNNs do not generalize well from existing data to unseen data. The main reason resulting in overfitting is that small-scale datasets cannot reflect the situations of the real world. Label smoothing (LS) is an effective regularization method to prevent overfitting, avoiding it by mixing one-hot labels with uniform label vectors. However, LS only focuses on labels while ignoring the distribution of existing data. In this paper, we introduce the distributionally robust optimization (DRO) to LS, achieving shift the existing data distribution flexibly to unseen domains when training DNNs. Specifically, we prove that the regularization of LS can be extended to a regularization term for the DNNs parameters when integrating DRO. The regularization term can be utilized to shift existing data to unseen domains and generate new data. Furthermore, we propose an approximate gradient-iteration label smoothing algorithm (GI-LS) to achieve the findings and train DNNs. We prove that the shift for the existing data does not influence the convergence of GI-LS. Since GI-LS incorporates a series of hyperparameters, we further consider using Bayesian optimization (BO) to find the relatively optimal combinations of these hyperparameters. Taking small-scale anomaly classification tasks as a case, we evaluate GI-LS, and the results clearly demonstrate its superior performance.

Generalizing Few Data to Unseen Domains Flexibly Based on Label Smoothing Integrated with Distributionally Robust Optimization

TL;DR

This work targets overfitting of deep nets trained on small datasets by merging label smoothing with distributionally robust optimization (DRO). It formulates a DRO-LS framework in which LS-induced label regularization couples with a Wasserstein-based ambiguity set to flexibly shift data toward unseen domains, effectively augmenting data without extra annotations. The proposed GI-LS algorithm solves this two-stage problem via inner gradient ascent for generating perturbed samples and outer SGD for training, with convergence guarantees and Bayesian optimization-guided hyperparameter tuning. Empirical results on small-scale anomaly datasets show that GI-LS outperforms standard LS and conventional data-augmentation methods, with robustness to perturbations in the DRO parameter and strong performance gains on MT-Defect/MT as well as stable gains on Wood/Carpet. The approach offers a principled, scalable path to improve generalization under data scarcity, with practical implications for domains like industrial anomaly detection where data collection is costly.

Abstract

Overfitting commonly occurs when applying deep neural networks (DNNs) on small-scale datasets, where DNNs do not generalize well from existing data to unseen data. The main reason resulting in overfitting is that small-scale datasets cannot reflect the situations of the real world. Label smoothing (LS) is an effective regularization method to prevent overfitting, avoiding it by mixing one-hot labels with uniform label vectors. However, LS only focuses on labels while ignoring the distribution of existing data. In this paper, we introduce the distributionally robust optimization (DRO) to LS, achieving shift the existing data distribution flexibly to unseen domains when training DNNs. Specifically, we prove that the regularization of LS can be extended to a regularization term for the DNNs parameters when integrating DRO. The regularization term can be utilized to shift existing data to unseen domains and generate new data. Furthermore, we propose an approximate gradient-iteration label smoothing algorithm (GI-LS) to achieve the findings and train DNNs. We prove that the shift for the existing data does not influence the convergence of GI-LS. Since GI-LS incorporates a series of hyperparameters, we further consider using Bayesian optimization (BO) to find the relatively optimal combinations of these hyperparameters. Taking small-scale anomaly classification tasks as a case, we evaluate GI-LS, and the results clearly demonstrate its superior performance.
Paper Structure (22 sections, 6 theorems, 78 equations, 8 figures, 8 tables, 1 algorithm)

This paper contains 22 sections, 6 theorems, 78 equations, 8 figures, 8 tables, 1 algorithm.

Key Result

Theorem 1

For any distribution $Q$ and any $\rho > 0$, For any $\gamma \geq 0$, we have

Figures (8)

  • Figure 1: Some surface defects of magnetic tiles.
  • Figure 2: Some surface anomalies of Wood and Carpet.
  • Figure 3: Bayesian optimization for MT-Defect, MT, Wood and Carpet. The blue, yellow and red lines represent the results of ResNet18, ResNet34, and ResNet50, respectively.
  • Figure 4: The relationship between $\alpha$ and $T$ as BO iterates. From up to down corresponds to ResNet18, ResNet34, and ResNet50. From left to right corresponds to MT-Defect, MT, Wood, and Carpet, respectively. The optimization results of each iteration are represented by black points. For each dataset, the left columns in green represent the top-1 accuracy for each BO iteration. The more concentrated the points, the higher the top-1 accuracy. The right columns illustrate the standard error for BO iteration. The more concentrated the points, the lower the standard error.
  • Figure 5: The relationship between $\alpha$ and $\eta$ as BO iterates. The settings are the same as those in Figure \ref{['Figure5']}.
  • ...and 3 more figures

Theorems & Definitions (7)

  • Definition 1: Wasserstein distance, sinha2018certifiable, sinha2018certifiable; villani2021topics, villani2021topics
  • Theorem 1
  • Theorem 2: Relationship between Existing Data and Optimal Generated Data
  • Lemma 1: bonnans2013perturbation, bonnans2013perturbation
  • Theorem 3: Bound of Difference between Existing Data and Optimal Generated Data
  • Theorem 4: Bound of Surrogate Robustness Loss
  • Theorem 5: Algorithm Convergence