Learning with Unmasked Tokens Drives Stronger Vision Learners

Taekyung Kim; Sanghyuk Chun; Byeongho Heo; Dongyoon Han

Learning with Unmasked Tokens Drives Stronger Vision Learners

Taekyung Kim, Sanghyuk Chun, Byeongho Heo, Dongyoon Han

TL;DR

Masked image modeling (MIM) methods like MAE often train encoders to focus on local pixel details, limiting broad-context understanding. The authors propose Learning with Unmasked Tokens (LUT), which adds a broader contextualization loss that guides unmasked tokens using a momentum-encoded global context, thereby improving long-range dependencies without sacrificing the reconstruction objective. LUT achieves improved ImageNet-1K top-1 accuracy across ViT-S/16, ViT-B/16, and ViT-L/16, strong ADE20K segmentation results, and robust transfer to iNaturalist and FGVC tasks, while offering faster pre-training than MAE. Analyses using Grad-CAM, attention distance, and spectral metrics corroborate that LUT learns more discriminative, broader-context representations, suggesting broad applicability to downstream vision tasks.

Abstract

Masked image modeling (MIM) has become a leading self-supervised learning strategy. MIMs such as Masked Autoencoder (MAE) learn strong representations by randomly masking input tokens for the encoder to process, with the decoder reconstructing the masked tokens to the input. However, MIM pre-trained encoders often exhibit a limited attention span, attributed to MIM's sole focus on regressing masked tokens only, which may impede the encoder's broader context learning. To tackle the limitation, we improve MIM by explicitly incorporating unmasked tokens into the training process. Specifically, our method enables the encoder to learn from broader context supervision, allowing unmasked tokens to experience broader contexts while the decoder reconstructs masked tokens. Thus, the encoded unmasked tokens are equipped with extensive contextual information, empowering masked tokens to leverage the enhanced unmasked tokens for MIM. As a result, our simple remedy trains more discriminative representations revealed by achieving 84.2% top-1 accuracy with ViT-B on ImageNet-1K with 0.6%p gain. We attribute the success to the enhanced pre-training method, as evidenced by the singular value spectrum and attention analyses. Finally, our models achieve significant performance gains at the downstream semantic segmentation and fine-grained visual classification tasks; and on diverse robust evaluation metrics. Code is available at https://github.com/naver-ai/lut

Learning with Unmasked Tokens Drives Stronger Vision Learners

TL;DR

Abstract

Paper Structure (37 sections, 4 equations, 9 figures, 10 tables)

This paper contains 37 sections, 4 equations, 9 figures, 10 tables.

Introduction
Preliminary
MIM and Beyond
General formulation.
MIM formulation itself falls short in learning broader contexts.
Motivation - attention map visualizations
Method
Our simple solution.
Contextualized supervision.
Sparse unmasked tokens that learn broad contexts.
On contextual discrepancies across views.
Objective function.
Related Work
Experiment
ImageNet-1K Classification
...and 22 more sections

Figures (9)

Figure 1: Motivation - MAE may lack comprehensive region-wide attention. We observed how attention appears differently in MAE corresponding to given queries. (a) The first column denotes the example images with (different queries) randomly picked patch indices. (b), (c) Every set of three columns represents the maps that are the most attended by different heads. The turtle images have a foreground, and two upper and lower background queries; the bird images have two foreground queries (upper two rows) and one background query. MAE shows localized attention maps but fails to provide comprehensive coverage of either foreground or background.
Figure 2: Framework overview. Our method performs a masked image modeling with masked tokens, complemented by a context encoder that directs the learnable side, giving an augmented complete view for sparse unmasked tokens. We employ distinct, simple MLP heads on top of each encoder to match representations and avoid optimization collapse. Under our macro concept, we opt for a simple choice where the additional context momentum encoder mirrors the online encoder, while we may alternatively use various options. We borrow a flamingo image from n02007558 class in ImageNet-1K.
Figure 3: Input
Figure 4: MAE
Figure 5: Ours
...and 4 more figures

Learning with Unmasked Tokens Drives Stronger Vision Learners

TL;DR

Abstract

Learning with Unmasked Tokens Drives Stronger Vision Learners

Authors

TL;DR

Abstract

Table of Contents

Figures (9)