Table of Contents
Fetching ...

Self-Bootstrapping for Versatile Test-Time Adaptation

Shuaicheng Niu, Guohao Chen, Peilin Zhao, Tianyi Wang, Pengcheng Wu, Zhiqi Shen

TL;DR

Self-Bootstrapping for Versatile Test-Time Adaptation (SPA) introduces a task- and architecture-agnostic TTA framework that uses the original image as a strong target and a geometry-preserving deteriorated view as a weak input. It employs active weak-to-strong learning with a prediction-consistency objective and two Fourier-domain augmentations—low-frequency amplitude masking and high-frequency noise injection—grounded in a frequency-domain analysis of domain shifts. The method updates a small subset of parameters at test time and includes a confidence-based selection to avoid unreliable supervision. SPA demonstrates state-of-the-art or competitive improvements on image classification, 3D monocular detection, and segmentation, and functions as a practical plug-in module for existing TTA methods.

Abstract

In this paper, we seek to develop a versatile test-time adaptation (TTA) objective for a variety of tasks - classification and regression across image-, object-, and pixel-level predictions. We achieve this through a self-bootstrapping scheme that optimizes prediction consistency between the test image (as target) and its deteriorated view. The key challenge lies in devising effective augmentations/deteriorations that: i) preserve the image's geometric information, e.g., object sizes and locations, which is crucial for TTA on object/pixel-level tasks, and ii) provide sufficient learning signals for TTA. To this end, we analyze how common distribution shifts affect the image's information power across spatial frequencies in the Fourier domain, and reveal that low-frequency components carry high power and masking these components supplies more learning signals, while masking high-frequency components can not. In light of this, we randomly mask the low-frequency amplitude of an image in its Fourier domain for augmentation. Meanwhile, we also augment the image with noise injection to compensate for missing learning signals at high frequencies, by enhancing the information power there. Experiments show that, either independently or as a plug-and-play module, our method achieves superior results across classification, segmentation, and 3D monocular detection tasks with both transformer and CNN models.

Self-Bootstrapping for Versatile Test-Time Adaptation

TL;DR

Self-Bootstrapping for Versatile Test-Time Adaptation (SPA) introduces a task- and architecture-agnostic TTA framework that uses the original image as a strong target and a geometry-preserving deteriorated view as a weak input. It employs active weak-to-strong learning with a prediction-consistency objective and two Fourier-domain augmentations—low-frequency amplitude masking and high-frequency noise injection—grounded in a frequency-domain analysis of domain shifts. The method updates a small subset of parameters at test time and includes a confidence-based selection to avoid unreliable supervision. SPA demonstrates state-of-the-art or competitive improvements on image classification, 3D monocular detection, and segmentation, and functions as a practical plug-in module for existing TTA methods.

Abstract

In this paper, we seek to develop a versatile test-time adaptation (TTA) objective for a variety of tasks - classification and regression across image-, object-, and pixel-level predictions. We achieve this through a self-bootstrapping scheme that optimizes prediction consistency between the test image (as target) and its deteriorated view. The key challenge lies in devising effective augmentations/deteriorations that: i) preserve the image's geometric information, e.g., object sizes and locations, which is crucial for TTA on object/pixel-level tasks, and ii) provide sufficient learning signals for TTA. To this end, we analyze how common distribution shifts affect the image's information power across spatial frequencies in the Fourier domain, and reveal that low-frequency components carry high power and masking these components supplies more learning signals, while masking high-frequency components can not. In light of this, we randomly mask the low-frequency amplitude of an image in its Fourier domain for augmentation. Meanwhile, we also augment the image with noise injection to compensate for missing learning signals at high frequencies, by enhancing the information power there. Experiments show that, either independently or as a plug-and-play module, our method achieves superior results across classification, segmentation, and 3D monocular detection tasks with both transformer and CNN models.

Paper Structure

This paper contains 19 sections, 5 equations, 4 figures, 10 tables, 1 algorithm.

Figures (4)

  • Figure 1: Illustration of SPA method. (a) We conduct self-bootstrapping learning for TTA by maximizing prediction consistency from the weak augmented/deteriorated views to the strong original image view. The augmentations are designed to preserve geometric structure by (b) randomly masking low-frequency components of the image’s amplitude in the Fourier domain, and (c) injecting Gaussian noise into the original image to enhance the information intensity on high frequency. 'sg': stop gradient. (I)FFT: (Inverse) Fast Fourier Transform.
  • Figure 2: (a-d) Changes of radially averaged power spectral density (RAPSD) rapsd under domain shifts. (e) SPA’s geometry-preserving augmentations reduce the RAPSD at low frequencies and enhance it at high frequencies to create deteriorated images for our self-bootstrapping learning. We separately select 512 images from the Source, ImageNet-R, ImageNet-C (15 corruptions), to perform FFT, and visualize their mean RAPSD based on the spectrum amplitude.
  • Figure 3: Sensitivity of amplitude mask ratio $m$ in Eqn. (\ref{['eq:fourier_low_mask']}) and noise injection ratio $\gamma$ in Eqn. (\ref{['eq:noise_injection']}). We use ViT-Base for ImageNet-C (Gaussian Noise) and MonoFlex for KITTI-Fog. The source model Acc./AP on ImageNet-C/KITTI-Fog is 55.5%/4.5%.
  • Figure 4: Visualizations of partial images in ImageNet, ImageNet-C/A/R/Sketch, KITTI, KITTI-C, Cityscapes and ACDC.