Table of Contents
Fetching ...

Towards Robust Out-of-Distribution Generalization: Data Augmentation and Neural Architecture Search Approaches

Haoyue Bai

TL;DR

This thesis proposes a novel and effective approach to disentangle the spurious correlation between features that are not essential for recognition, and discovers robust architectures that perform well for OoD generalization.

Abstract

Deep learning has been demonstrated with tremendous success in recent years. Despite so, its performance in practice often degenerates drastically when encountering out-of-distribution (OoD) data, i.e. training and test data are sampled from different distributions. In this thesis, we study ways toward robust OoD generalization for deep learning, i.e., its performance is not susceptible to distribution shift in the test data. We first propose a novel and effective approach to disentangle the spurious correlation between features that are not essential for recognition. It employs decomposed feature representation by orthogonalizing the two gradients of losses for category and context branches. Furthermore, we perform gradient-based augmentation on context-related features (e.g., styles, backgrounds, or scenes of target objects) to improve the robustness of learned representations. Results show that our approach generalizes well for different distribution shifts. We then study the problem of strengthening neural architecture search in OoD scenarios. We propose to optimize the architecture parameters that minimize the validation loss on synthetic OoD data, under the condition that corresponding network parameters minimize the training loss. Moreover, to obtain a proper validation set, we learn a conditional generator by maximizing their losses computed by different neural architectures. Results show that our approach effectively discovers robust architectures that perform well for OoD generalization.

Towards Robust Out-of-Distribution Generalization: Data Augmentation and Neural Architecture Search Approaches

TL;DR

This thesis proposes a novel and effective approach to disentangle the spurious correlation between features that are not essential for recognition, and discovers robust architectures that perform well for OoD generalization.

Abstract

Deep learning has been demonstrated with tremendous success in recent years. Despite so, its performance in practice often degenerates drastically when encountering out-of-distribution (OoD) data, i.e. training and test data are sampled from different distributions. In this thesis, we study ways toward robust OoD generalization for deep learning, i.e., its performance is not susceptible to distribution shift in the test data. We first propose a novel and effective approach to disentangle the spurious correlation between features that are not essential for recognition. It employs decomposed feature representation by orthogonalizing the two gradients of losses for category and context branches. Furthermore, we perform gradient-based augmentation on context-related features (e.g., styles, backgrounds, or scenes of target objects) to improve the robustness of learned representations. Results show that our approach generalizes well for different distribution shifts. We then study the problem of strengthening neural architecture search in OoD scenarios. We propose to optimize the architecture parameters that minimize the validation loss on synthetic OoD data, under the condition that corresponding network parameters minimize the training loss. Moreover, to obtain a proper validation set, we learn a conditional generator by maximizing their losses computed by different neural architectures. Results show that our approach effectively discovers robust architectures that perform well for OoD generalization.

Paper Structure

This paper contains 34 sections, 2 theorems, 23 equations, 15 figures, 13 tables, 2 algorithms.

Key Result

Theorem 1

For arbitrary $\sigma \geq 0$, $\mathbb{E}[\mathcal{L}^2]$ and $\mathbb{E}[\mathcal{L}^\textnormal{orth}]$ are both minimized only if $\Hat{y}$ does not predict $y$ from $x_2$, i.e., $\alpha_{2,1} = 0$, even if there is spurious correlation between $X_2$ and $Y$.

Figures (15)

  • Figure 3. 1: Illustration of the two-dimensional OoD shifts among datasets in different OoD research areas, including Colored MNIST, PACS, and NICO. Extensive experiments showed that many OoD methods can only deal with one dimension of OoD shift.
  • Figure 3. 2: Typical examples of the two-dimensional out-of-distribution data from Colored MNIST, PACS, and NICO. For the two-dimensional OoD data from the NICO dataset. Contexts such as "on grass", "on snow" and "in water" result in mini-domains in the dataset, suggesting the diversity shift among the data. On the other hand, specific contexts such as "at home" are common for cats while is unusual for dogs. The category branch and the context branch are correlated, indicating the correlation shift among data.
  • Figure 3. 3: An overview of the proposed DecAug. The input features $z$ extracted by the backbone are decomposed into category-related and context-related features with orthogonal regularization. Gradient-based augmentation is then preformed in the feature space to get semantic augmented samples.
  • Figure 3. 4: The gradient visualization of the decomposed category-related and context-related high-dimensional features. The first row is the original input images, the second row is its corresponding back propagation of the category branch and the last row is the back propagation of the context branch.
  • Figure 3. 5: The t-SNE visualization of the decomposed high-dimensional category-related and context-related features. (a) Embedding of category branch versus category labels. (b) Embedding of context branch versus context labels. The difference between (a) and (b) shows the high-level category-related and context-related features are well decomposed.
  • ...and 10 more figures

Theorems & Definitions (4)

  • Theorem 1
  • Lemma 1
  • proof
  • proof