Table of Contents
Fetching ...

Improving Generative Pre-Training: An In-depth Study of Masked Image Modeling and Denoising Models

Hyesong Choi, Daeun Kim, Sungmin Cha, Kwang Moo Yi, Dongbo Min

TL;DR

This study investigates why integrating denoising with masked image modeling often fails to improve recognition performance and identifies three guiding principles for effective pre-training: apply corruption and restoration within the encoder, inject noise at the feature level (preferably in the lower encoder layers to capture high-frequency details), and disentangle the denoising and masking objectives. Building on these insights, the authors propose an encoder-style pre-training framework with feature-space noise and a disruption loss to suppress cross-talk between noisy and masked tokens. The method consistently surpasses masked image modeling and recent diffusion-based pre-training across a broad suite of tasks, including fine-grained categorization, semantic segmentation, and object detection, demonstrating improved transferability and better capture of high-frequency information. Overall, the work reframes generative pre-training for self-supervised visual representation learning and offers practical guidelines for leveraging noise to boost recognition performance.

Abstract

In this work, we dive deep into the impact of additive noise in pre-training deep networks. While various methods have attempted to use additive noise inspired by the success of latent denoising diffusion models, when used in combination with masked image modeling, their gains have been marginal when it comes to recognition tasks. We thus investigate why this would be the case, in an attempt to find effective ways to combine the two ideas. Specifically, we find three critical conditions: corruption and restoration must be applied within the encoder, noise must be introduced in the feature space, and an explicit disentanglement between noised and masked tokens is necessary. By implementing these findings, we demonstrate improved pre-training performance for a wide range of recognition tasks, including those that require fine-grained, high-frequency information to solve.

Improving Generative Pre-Training: An In-depth Study of Masked Image Modeling and Denoising Models

TL;DR

This study investigates why integrating denoising with masked image modeling often fails to improve recognition performance and identifies three guiding principles for effective pre-training: apply corruption and restoration within the encoder, inject noise at the feature level (preferably in the lower encoder layers to capture high-frequency details), and disentangle the denoising and masking objectives. Building on these insights, the authors propose an encoder-style pre-training framework with feature-space noise and a disruption loss to suppress cross-talk between noisy and masked tokens. The method consistently surpasses masked image modeling and recent diffusion-based pre-training across a broad suite of tasks, including fine-grained categorization, semantic segmentation, and object detection, demonstrating improved transferability and better capture of high-frequency information. Overall, the work reframes generative pre-training for self-supervised visual representation learning and offers practical guidelines for leveraging noise to boost recognition performance.

Abstract

In this work, we dive deep into the impact of additive noise in pre-training deep networks. While various methods have attempted to use additive noise inspired by the success of latent denoising diffusion models, when used in combination with masked image modeling, their gains have been marginal when it comes to recognition tasks. We thus investigate why this would be the case, in an attempt to find effective ways to combine the two ideas. Specifically, we find three critical conditions: corruption and restoration must be applied within the encoder, noise must be introduced in the feature space, and an explicit disentanglement between noised and masked tokens is necessary. By implementing these findings, we demonstrate improved pre-training performance for a wide range of recognition tasks, including those that require fine-grained, high-frequency information to solve.
Paper Structure (28 sections, 9 equations, 12 figures)

This paper contains 28 sections, 9 equations, 12 figures.

Figures (12)

  • Figure 1: We find that noise-based pre-training, when applied in the right way, can enhance transfer learning ability. Leveraging the insights, we introduce a novel pre-training setup combining masking and noising, outperforming MIM baselines he2022maskedxie2022simmim and recent noise-based generative approaches wei2023diffusionzheng2023fast across a wide range of recognition tasks, including fine-grained recognition.
  • Figure 2: We display the KL divergence among attention distributions across different heads (indicated by small dots) and the mean KL divergence (represented by large dots) in each layer for (a) a recent generative model wei2023diffusion, (b) a representative masked image model xie2022simmim, and (c) our method. This assesses whether various attention heads capture diverse frequency information, where higher KL divergence indicates broader frequency capture. Our method demonstrates a greater capacity for capturing diverse frequency information than MIM and generative approaches, which explains why it performs well across a wide range of recognition tasks, including fine-grained tasks.
  • Figure 3: We visualized the self-attention maps for the image classification token in the final layer of our model on a fine-grained visual categorization benchmark. The proposed method captures a range of frequencies by focusing effectively on both key features and fine details within complex scenes.
  • Figure 4: Fine-grained visual categorization (FGVC) is a critical benchmark for evaluating recognition models and require detailed, localized feature learning. However, MIM approaches he2022maskedxie2022simmim show limitations on FGVC tasks, with the radar graph revealing substantial room for improvement to reach the ideal boundary and also to the performance on the standard recognition task.
  • Figure 5: Our evaluations reveal that recent generative pre-training approaches wei2023diffusionzheng2023fast yield limited gains over MIM baselines xie2022simmimhe2022masked on recognition tasks, suggesting that simply adding denoising to MIM pre-training does not inherently elevate the representation quality essential for precise recognition. Except for DiffMAE wei2023diffusion, we rely on the official implementation---for DiffMAE, we carefully reimplemented the method based on the manuscript as no code is available. For reproducibility, all implemented code has been included in the Supplementary Material.
  • ...and 7 more figures