Improving Generative Pre-Training: An In-depth Study of Masked Image Modeling and Denoising Models
Hyesong Choi, Daeun Kim, Sungmin Cha, Kwang Moo Yi, Dongbo Min
TL;DR
This study investigates why integrating denoising with masked image modeling often fails to improve recognition performance and identifies three guiding principles for effective pre-training: apply corruption and restoration within the encoder, inject noise at the feature level (preferably in the lower encoder layers to capture high-frequency details), and disentangle the denoising and masking objectives. Building on these insights, the authors propose an encoder-style pre-training framework with feature-space noise and a disruption loss to suppress cross-talk between noisy and masked tokens. The method consistently surpasses masked image modeling and recent diffusion-based pre-training across a broad suite of tasks, including fine-grained categorization, semantic segmentation, and object detection, demonstrating improved transferability and better capture of high-frequency information. Overall, the work reframes generative pre-training for self-supervised visual representation learning and offers practical guidelines for leveraging noise to boost recognition performance.
Abstract
In this work, we dive deep into the impact of additive noise in pre-training deep networks. While various methods have attempted to use additive noise inspired by the success of latent denoising diffusion models, when used in combination with masked image modeling, their gains have been marginal when it comes to recognition tasks. We thus investigate why this would be the case, in an attempt to find effective ways to combine the two ideas. Specifically, we find three critical conditions: corruption and restoration must be applied within the encoder, noise must be introduced in the feature space, and an explicit disentanglement between noised and masked tokens is necessary. By implementing these findings, we demonstrate improved pre-training performance for a wide range of recognition tasks, including those that require fine-grained, high-frequency information to solve.
