Table of Contents
Fetching ...

Frequency-Guided Masking for Enhanced Vision Self-Supervised Learning

Amin Karimi Monsefi, Mengxi Zhou, Nastaran Karimi Monsefi, Ser-Nam Lim, Wei-Lun Chao, Rajiv Ramnath

TL;DR

FOLK introduces adaptive frequency-domain masking via Com and RCom filters and a dual-view teacher–student framework to enhance vision self-supervised learning. By exposing the model to both frequency-masked and original images through self-distillation, FOLK mitigates the need for extensive downstream fine-tuning and improves data efficiency, achieving competitive results on ImageNet-1K and ADE20K with fewer pre-training epochs. The method combines reconstructive learning of masked frequencies with distillation of the teacher’s representations, yielding robust representations for diverse tasks including few-shot learning. Empirical results demonstrate strong performance gains over MFM and other SSL baselines across image classification, few-shot learning, and semantic segmentation, while maintaining reasonable compute and memory demands.

Abstract

We present a novel frequency-based Self-Supervised Learning (SSL) approach that significantly enhances its efficacy for pre-training. Prior work in this direction masks out pre-defined frequencies in the input image and employs a reconstruction loss to pre-train the model. While achieving promising results, such an implementation has two fundamental limitations as identified in our paper. First, using pre-defined frequencies overlooks the variability of image frequency responses. Second, pre-trained with frequency-filtered images, the resulting model needs relatively more data to adapt to naturally looking images during fine-tuning. To address these drawbacks, we propose FOurier transform compression with seLf-Knowledge distillation (FOLK), integrating two dedicated ideas. First, inspired by image compression, we adaptively select the masked-out frequencies based on image frequency responses, creating more suitable SSL tasks for pre-training. Second, we employ a two-branch framework empowered by knowledge distillation, enabling the model to take both the filtered and original images as input, largely reducing the burden of downstream tasks. Our experimental results demonstrate the effectiveness of FOLK in achieving competitive performance to many state-of-the-art SSL methods across various downstream tasks, including image classification, few-shot learning, and semantic segmentation.

Frequency-Guided Masking for Enhanced Vision Self-Supervised Learning

TL;DR

FOLK introduces adaptive frequency-domain masking via Com and RCom filters and a dual-view teacher–student framework to enhance vision self-supervised learning. By exposing the model to both frequency-masked and original images through self-distillation, FOLK mitigates the need for extensive downstream fine-tuning and improves data efficiency, achieving competitive results on ImageNet-1K and ADE20K with fewer pre-training epochs. The method combines reconstructive learning of masked frequencies with distillation of the teacher’s representations, yielding robust representations for diverse tasks including few-shot learning. Empirical results demonstrate strong performance gains over MFM and other SSL baselines across image classification, few-shot learning, and semantic segmentation, while maintaining reasonable compute and memory demands.

Abstract

We present a novel frequency-based Self-Supervised Learning (SSL) approach that significantly enhances its efficacy for pre-training. Prior work in this direction masks out pre-defined frequencies in the input image and employs a reconstruction loss to pre-train the model. While achieving promising results, such an implementation has two fundamental limitations as identified in our paper. First, using pre-defined frequencies overlooks the variability of image frequency responses. Second, pre-trained with frequency-filtered images, the resulting model needs relatively more data to adapt to naturally looking images during fine-tuning. To address these drawbacks, we propose FOurier transform compression with seLf-Knowledge distillation (FOLK), integrating two dedicated ideas. First, inspired by image compression, we adaptively select the masked-out frequencies based on image frequency responses, creating more suitable SSL tasks for pre-training. Second, we employ a two-branch framework empowered by knowledge distillation, enabling the model to take both the filtered and original images as input, largely reducing the burden of downstream tasks. Our experimental results demonstrate the effectiveness of FOLK in achieving competitive performance to many state-of-the-art SSL methods across various downstream tasks, including image classification, few-shot learning, and semantic segmentation.
Paper Structure (44 sections, 7 equations, 5 figures, 15 tables, 1 algorithm)

This paper contains 44 sections, 7 equations, 5 figures, 15 tables, 1 algorithm.

Figures (5)

  • Figure 1: Detailed visualizations outlining our proposed masking approach compared to the low/high-pass filters used in the MFM method xie2022masked. Figures b, d, f, h here show the retained frequencies after applying the corresponding filter (zoom-in views on the shifted spectrum for better comparison). Figures c, e, g, i show the restored image from the retained frequencies. More examples can be found in Appendix \ref{['sec:app:prediction:visualization']}. All images used in this paper are from ImageNet deng2009imagenet.
  • Figure 2: The frequency-based filtering process with our proposed informed filters, $Com$ and $RCom$. A portion of frequencies with the highest magnitude will be determined to create the masks. The process randomly returns the resulting image from the $Com$ or $RCom$ filter.
  • Figure 3: The Proposed FOLK Framework. Two views ($u$ and $v$) of the input image are processed through the informed filtering process, introduced in Figure \ref{['fig:method:folk_filters_com_rcom']}. This yields two frequency-masked views ($\tilde{u}$ and $\tilde{v}$) which serve as the input to the student model. The student model is tasked with reconstructing the missing frequencies from the masked view $\tilde{u}$ (or $\tilde{v}$), as well as reconstructing the feature representation of the other original view $v$ (or $u$), generated by the teacher model, using the masked view $\tilde{u}$ (or $\tilde{v}$). The student model $g_{\theta}$ and the teacher model $g_{\phi}$, with their corresponding heads $h_{\theta}$ and $h_{\phi}$, have the same architecture but different parameters, except that the student has an additional MFM head $\hat{h}_{\theta}$. Only the student model (with its both heads) is updated through back-propagation, while the teacher parameters are periodically updated with an exponential moving average (EMA) of the corresponding student parameters.
  • Figure 4: Visual comparison between $Com$/$RCom$ filters employed by FOLK and other random masking techniques.
  • Figure 5: The loss progression of MFM using the original constant filters (denoted as "MFM") and MFM using the proposed $Com$ and $Rcom$ filters (denoted as "MFM + Com/Rcom") for the first 100 epochs during pre-training.