Table of Contents
Fetching ...

SETA: Semantic-Aware Token Augmentation for Domain Generalization

Jintao Guo, Lei Qi, Yinghuan Shi, Yang Gao

TL;DR

The Semantic-aware Edge-guided Token Augmentation (SETA) method, which transforms token features by perturbing local edge cues while preserving global shape features, thereby enhancing the model learning of shape information, is proposed and demonstrated its effectiveness in reducing the generalization risk bound.

Abstract

Domain generalization (DG) aims to enhance the model robustness against domain shifts without accessing target domains. A prevalent category of methods for DG is data augmentation, which focuses on generating virtual samples to simulate domain shifts. However, existing augmentation techniques in DG are mainly tailored for convolutional neural networks (CNNs), with limited exploration in token-based architectures, i.e., vision transformer (ViT) and multi-layer perceptrons (MLP) models. In this paper, we study the impact of prior CNN-based augmentation methods on token-based models, revealing their performance is suboptimal due to the lack of incentivizing the model to learn holistic shape information. To tackle the issue, we propose the SEmantic-aware Token Augmentation (SETA) method. SETA transforms token features by perturbing local edge cues while preserving global shape features, thereby enhancing the model learning of shape information. To further enhance the generalization ability of the model, we introduce two stylized variants of our method combined with two state-of-the-art style augmentation methods in DG. We provide a theoretical insight into our method, demonstrating its effectiveness in reducing the generalization risk bound. Comprehensive experiments on five benchmarks prove that our method achieves SOTA performances across various ViT and MLP architectures. Our code is available at https://github.com/lingeringlight/SETA.

SETA: Semantic-Aware Token Augmentation for Domain Generalization

TL;DR

The Semantic-aware Edge-guided Token Augmentation (SETA) method, which transforms token features by perturbing local edge cues while preserving global shape features, thereby enhancing the model learning of shape information, is proposed and demonstrated its effectiveness in reducing the generalization risk bound.

Abstract

Domain generalization (DG) aims to enhance the model robustness against domain shifts without accessing target domains. A prevalent category of methods for DG is data augmentation, which focuses on generating virtual samples to simulate domain shifts. However, existing augmentation techniques in DG are mainly tailored for convolutional neural networks (CNNs), with limited exploration in token-based architectures, i.e., vision transformer (ViT) and multi-layer perceptrons (MLP) models. In this paper, we study the impact of prior CNN-based augmentation methods on token-based models, revealing their performance is suboptimal due to the lack of incentivizing the model to learn holistic shape information. To tackle the issue, we propose the SEmantic-aware Token Augmentation (SETA) method. SETA transforms token features by perturbing local edge cues while preserving global shape features, thereby enhancing the model learning of shape information. To further enhance the generalization ability of the model, we introduce two stylized variants of our method combined with two state-of-the-art style augmentation methods in DG. We provide a theoretical insight into our method, demonstrating its effectiveness in reducing the generalization risk bound. Comprehensive experiments on five benchmarks prove that our method achieves SOTA performances across various ViT and MLP architectures. Our code is available at https://github.com/lingeringlight/SETA.
Paper Structure (14 sections, 24 equations, 8 figures, 9 tables)

This paper contains 14 sections, 24 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Comparisons of models on cross-domain performance, edge bias, shape bias, and style shift robustness. We compare our method with SOTA augmentation methods in DG (ALOFT guo2023aloft and DSU li2022uncertainty) on (a) original samples; (b) edge-pass samples reconstructed from phase spectrums bai2022improving; (c) style-perturbed samples with amplitude spectrums perturbed fridovich2022spectral; (d) patch-shuffled samples with shuffled patch locations within sample baker2018deep. Experiments are conducted on PACS with the backbone GFNet-H-Ti rao2021global.
  • Figure 2: An overview of the proposed SETA. Our SETA consists of three core modules, including 1) Activation-based Edge Tokens Selection (ETS) that distinguishes and extracts tokens containing edge information; 2) Shape Tokens Shuffling (STS) that generates texture noise by shuffling tokens from another sample, which disrupts holistic shape while keeping local edges; 3) Token Mixing module that superposes edge tokens from an object sample onto the shuffled tokens from another sample. The augmented sample is assigned the label of the sample contributing the edge tokens. We design two stylized variants of SETA, utilizing SOTA DG augmentation methods, i.e., DSU li2022uncertainty and ALOFT guo2023aloft, to create stylized token-shuffled samples to simulate potential domain shifts.
  • Figure 3: An intuitive illustration of our SETA. The ETS module extracts edge-related tokens from the original sample, while the STS module generates shape-disrupted noise by randomly selecting another sample from the current batch. The Mixup version of SETA blends the values of the edge-related tokens and shape-disrupted tokens, while the CutMix version of SETA replaces the edge-irrelated tokens by the shape-disrupted tokens.
  • Figure 4: The segmentation results on unseen domain CityScapes with the model trained on synthetic GTA$5$. We compare our method with the baseline and SOTA augmentation-based DG methods. The results indicate that our method can effectively improve the segmentation performance of the model.
  • Figure 5: Effects of hyper-parameters including inserted positions and low-frequency mask scale $r$ in SETA. The experiments are conducted on PACS with GFNet-H-Ti backbone. L$1$-$4$ are four transformer layers of the network.
  • ...and 3 more figures