Table of Contents
Fetching ...

RobustFormer: Noise-Robust Pre-training for images and videos

Ashish Bastola, Nishant Luitel, Hao Wang, Danda Pani Paudel, Roshani Poudel, Abolfazl Razi

TL;DR

RobustFormer tackles the vulnerability of vision transformers to noise by integrating Discrete Wavelet Transform (DWT) with masked autoencoder pre-training for both images and videos. It uses 3D-DWT downsampling and a DWT-aware attention mechanism that relies on low-frequency components, avoiding the costly IDWT step and reducing computation. Empirical results show substantial robustness gains on ImageNet-C and ImageNet-P, and notable improvements on UCF-101, while preserving performance on clean data and reducing FLOPs by up to 4.4%. The work is the first to deliver a full DWT-based MAE framework for video inputs, demonstrating that attention operating on multi-scale wavelet representations can yield practical robustness benefits with efficient computation.

Abstract

While deep learning-based models like transformers, have revolutionized time-series and vision tasks, they remain highly susceptible to noise and often overfit on noisy patterns rather than robust features. This issue is exacerbated in vision transformers, which rely on pixel-level details that can easily be corrupt. To address this, we leverage the discrete wavelet transform (DWT) for its ability to decompose into multi-resolution layers, isolating noise primarily in the high frequency domain while preserving essential low-frequency information for resilient feature learning. Conventional DWT-based methods, however, struggle with computational inefficiencies due to the requirement for a subsequent inverse discrete wavelet transform (IDWT) step. In this work, we introduce RobustFormer, a novel framework that enables noise-robust masked autoencoder (MAE) pre-training for both images and videos by using DWT for efficient downsampling, eliminating the need for expensive IDWT reconstruction and simplifying the attention mechanism to focus on noise-resilient multi-scale representations. To our knowledge, RobustFormer is the first DWT-based method fully compatible with video inputs and MAE-style pre-training. Extensive experiments on noisy image and video datasets demonstrate that our approach achieves up to 8% increase in Top-1 classification accuracy under severe noise conditions in Imagenet-C and up to 2.7% in Imagenet-P standard benchmarks compared to the baseline and up to 13% higher Top-1 accuracy on UCF-101 under severe custom noise perturbations while maintaining similar accuracy scores for clean datasets. We also observe the reduction of computation complexity by up to 4.4% through IDWT removal compared to VideoMAE baseline without any performance drop.

RobustFormer: Noise-Robust Pre-training for images and videos

TL;DR

RobustFormer tackles the vulnerability of vision transformers to noise by integrating Discrete Wavelet Transform (DWT) with masked autoencoder pre-training for both images and videos. It uses 3D-DWT downsampling and a DWT-aware attention mechanism that relies on low-frequency components, avoiding the costly IDWT step and reducing computation. Empirical results show substantial robustness gains on ImageNet-C and ImageNet-P, and notable improvements on UCF-101, while preserving performance on clean data and reducing FLOPs by up to 4.4%. The work is the first to deliver a full DWT-based MAE framework for video inputs, demonstrating that attention operating on multi-scale wavelet representations can yield practical robustness benefits with efficient computation.

Abstract

While deep learning-based models like transformers, have revolutionized time-series and vision tasks, they remain highly susceptible to noise and often overfit on noisy patterns rather than robust features. This issue is exacerbated in vision transformers, which rely on pixel-level details that can easily be corrupt. To address this, we leverage the discrete wavelet transform (DWT) for its ability to decompose into multi-resolution layers, isolating noise primarily in the high frequency domain while preserving essential low-frequency information for resilient feature learning. Conventional DWT-based methods, however, struggle with computational inefficiencies due to the requirement for a subsequent inverse discrete wavelet transform (IDWT) step. In this work, we introduce RobustFormer, a novel framework that enables noise-robust masked autoencoder (MAE) pre-training for both images and videos by using DWT for efficient downsampling, eliminating the need for expensive IDWT reconstruction and simplifying the attention mechanism to focus on noise-resilient multi-scale representations. To our knowledge, RobustFormer is the first DWT-based method fully compatible with video inputs and MAE-style pre-training. Extensive experiments on noisy image and video datasets demonstrate that our approach achieves up to 8% increase in Top-1 classification accuracy under severe noise conditions in Imagenet-C and up to 2.7% in Imagenet-P standard benchmarks compared to the baseline and up to 13% higher Top-1 accuracy on UCF-101 under severe custom noise perturbations while maintaining similar accuracy scores for clean datasets. We also observe the reduction of computation complexity by up to 4.4% through IDWT removal compared to VideoMAE baseline without any performance drop.

Paper Structure

This paper contains 14 sections, 7 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Accuracy vs. relative robustness (performance on corrupted vs. clean data) of action recognition models on UCF-101.
  • Figure 2: Comparison between different architectures designed for video tasks. (a) is the regular masked autoencoder tong2022videomae, (b) is the DWT-based architecture with IDWT module, and (c) is our proposed method.
  • Figure 3: The framework of RobustFormer. Our approach integrates spatio-temporal tube masking as well as multi-resolution feature transformation using DWT to handle real-world noise types.
  • Figure 4: Comparison of Top-1 and Top-5 Accuracy for Imagenet-C and Imagenet-P. Accuracy for Imagent-C are averaged across severity levels
  • Figure 5: Comparison of Top-1 and Top-5 Accuracy for different severity levels across rain and packet loss noise for UCF-101 dataset
  • ...and 1 more figures