Table of Contents
Fetching ...

Exploring the Coordination of Frequency and Attention in Masked Image Modeling

Jie Gui, Tuo Chen, Minjing Dong, Zhengqi Liu, Hao Luo, James Tin-Yau Kwok, Yuan Yan Tang

TL;DR

Masked image modeling (MIM) often suffers from high pretraining cost and suboptimal semantic focus due to random masking. The paper introduces Frequency mp; Attention-driven Masking and Throwing (FAMT), a plug-and-play module that combines Vision Transformer self-attention with FFT-based low-pass guidance to compute robust token importance, and a patch-throwing strategy to reduce computation. By updating token weights periodically and discarding less informative regions, FAMT speeds pretraining by up to about $50\%$ and improves MAE linear probing accuracy by roughly $1.3\% \sim 3.9\%$ across CIFAR-10/100, Tiny ImageNet, and ImageNet-1K, with additional gains in detection and segmentation on ADE20K, COCO, LVIS, and iSAID. The approach demonstrates strong transferability and generality across MIM backbones and downstream tasks, supporting its adoption as a versatile enhancement for self-supervised vision representation learning.

Abstract

Recently, masked image modeling (MIM), which learns visual representations by reconstructing the masked patches of an image, has dominated self-supervised learning in computer vision. However, the pre-training of MIM always takes massive time due to the large-scale data and large-size backbones. We mainly attribute it to the random patch masking in previous MIM works, which fails to leverage the crucial semantic information for effective visual representation learning. To tackle this issue, we propose the Frequency \& Attention-driven Masking and Throwing Strategy (FAMT), which can extract semantic patches and reduce the number of training patches to boost model performance and training efficiency simultaneously. Specifically, FAMT utilizes the self-attention mechanism to extract semantic information from the image for masking during training in an unsupervised manner. However, attention alone could sometimes focus on inappropriate areas regarding the semantic information. Thus, we are motivated to incorporate the information from the frequency domain into the self-attention mechanism to derive the sampling weights for masking, which captures semantic patches for visual representation learning. Furthermore, we introduce a patch throwing strategy based on the derived sampling weights to reduce the training cost. FAMT can be seamlessly integrated as a plug-and-play module and surpasses previous works, \emph{e.g.} reducing the training phase time by nearly $50\%$ and improving the linear probing accuracy of MAE by $1.3\% \sim 3.9\%$ across various datasets, including CIFAR-10/100, Tiny ImageNet, and ImageNet-1K. FAMT also demonstrates superior performance in downstream detection and segmentation tasks.

Exploring the Coordination of Frequency and Attention in Masked Image Modeling

TL;DR

Masked image modeling (MIM) often suffers from high pretraining cost and suboptimal semantic focus due to random masking. The paper introduces Frequency mp; Attention-driven Masking and Throwing (FAMT), a plug-and-play module that combines Vision Transformer self-attention with FFT-based low-pass guidance to compute robust token importance, and a patch-throwing strategy to reduce computation. By updating token weights periodically and discarding less informative regions, FAMT speeds pretraining by up to about and improves MAE linear probing accuracy by roughly across CIFAR-10/100, Tiny ImageNet, and ImageNet-1K, with additional gains in detection and segmentation on ADE20K, COCO, LVIS, and iSAID. The approach demonstrates strong transferability and generality across MIM backbones and downstream tasks, supporting its adoption as a versatile enhancement for self-supervised vision representation learning.

Abstract

Recently, masked image modeling (MIM), which learns visual representations by reconstructing the masked patches of an image, has dominated self-supervised learning in computer vision. However, the pre-training of MIM always takes massive time due to the large-scale data and large-size backbones. We mainly attribute it to the random patch masking in previous MIM works, which fails to leverage the crucial semantic information for effective visual representation learning. To tackle this issue, we propose the Frequency \& Attention-driven Masking and Throwing Strategy (FAMT), which can extract semantic patches and reduce the number of training patches to boost model performance and training efficiency simultaneously. Specifically, FAMT utilizes the self-attention mechanism to extract semantic information from the image for masking during training in an unsupervised manner. However, attention alone could sometimes focus on inappropriate areas regarding the semantic information. Thus, we are motivated to incorporate the information from the frequency domain into the self-attention mechanism to derive the sampling weights for masking, which captures semantic patches for visual representation learning. Furthermore, we introduce a patch throwing strategy based on the derived sampling weights to reduce the training cost. FAMT can be seamlessly integrated as a plug-and-play module and surpasses previous works, \emph{e.g.} reducing the training phase time by nearly and improving the linear probing accuracy of MAE by across various datasets, including CIFAR-10/100, Tiny ImageNet, and ImageNet-1K. FAMT also demonstrates superior performance in downstream detection and segmentation tasks.
Paper Structure (37 sections, 8 equations, 8 figures, 10 tables, 1 algorithm)

This paper contains 37 sections, 8 equations, 8 figures, 10 tables, 1 algorithm.

Figures (8)

  • Figure 1: Visulization of FAMT. (a) is the original image. (b)-(d) are visualizations for the self-attention of the [CLS] token on the heads of the last layer following DINO dino, which denotes the results of different masking and throwing schemes based on MAE. (b) by random masking strategy, (c) by frequency & attention-driven masking strategy, and (d) by frequency & attention-driven masking and throwing strategy.
  • Figure 2: Visualization of attention. For each subfigure, reading from left to right and top to bottom, there are the following images: the original image and the attention map from the last layer of the MAE encoder at different training stages (40th, 60th, 80th, 100th).
  • Figure 3: Overview of common MIM methods and FAMT. The top of the figure denotes the simplified common MIM methods and the bottom is the simplified overview of our FAMT. The gray patches are masked patches. The black patches denote thrown tokens that are not input into the model, meaning that thrown tokens do not cost computational resources. Compared to original methods, FAMT leverages the frequency information and attention to mask and throw intentionally.
  • Figure 4: Visualization of the attention map of the last layer in the encoder after 400 epochs pre-training. From left to right, there is the original image, the attention map from the last layer of the MAE encoder using random masking, attention-driven masking, and FAMT, respectively.
  • Figure 5: The pipeline of FAMT for updating $P_A$. The filter is a Gaussian low-pass filter.
  • ...and 3 more figures