Emerging Property of Masked Token for Effective Pre-training

Hyesong Choi; Hunsang Lee; Seyoung Joung; Hyejin Park; Jiyeong Kim; Dongbo Min

Emerging Property of Masked Token for Effective Pre-training

Hyesong Choi, Hunsang Lee, Seyoung Joung, Hyejin Park, Jiyeong Kim, Dongbo Min

TL;DR

This work identifies prolonged pre-training as a key inefficiency in Masked Image Modeling and proposes Masked Token Optimization (MTO), a plug‑and‑play framework that explicitly optimizes masked tokens to enforce data singularity and control their interaction with visible tokens. By introducing a heterogeneity measure based on entropy and three targeted losses ($\mathcal{L}_{spa}$, $\mathcal{L}_{e}$, $\mathcal{L}_{r}$) coupled with weight recalibration, the authors demonstrate accelerated convergence across multiple MIM baselines (SimMIM, MAE, BootMAE, ConMIM) and backbone sizes (ViT‑B/L). Empirical results use the RAUC metric to quantify faster pre-training, showing reductions of roughly 50% in required epochs to reach baseline performance, along with consistent performance gains. The proposed approach is versatile, improving efficiency without altering core MIM objectives, and highlights data singularity as a central design principle for masked tokens, with a caveat that these properties remain dynamic and open to refinement.

Abstract

Driven by the success of Masked Language Modeling (MLM), the realm of self-supervised learning for computer vision has been invigorated by the central role of Masked Image Modeling (MIM) in driving recent breakthroughs. Notwithstanding the achievements of MIM across various downstream tasks, its overall efficiency is occasionally hampered by the lengthy duration of the pre-training phase. This paper presents a perspective that the optimization of masked tokens as a means of addressing the prevailing issue. Initially, we delve into an exploration of the inherent properties that a masked token ought to possess. Within the properties, we principally dedicated to articulating and emphasizing the `data singularity' attribute inherent in masked tokens. Through a comprehensive analysis of the heterogeneity between masked tokens and visible tokens within pre-trained models, we propose a novel approach termed masked token optimization (MTO), specifically designed to improve model efficiency through weight recalibration and the enhancement of the key property of masked tokens. The proposed method serves as an adaptable solution that seamlessly integrates into any MIM approach that leverages masked tokens. As a result, MTO achieves a considerable improvement in pre-training efficiency, resulting in an approximately 50% reduction in pre-training epochs required to attain converged performance of the recent approaches.

Emerging Property of Masked Token for Effective Pre-training

TL;DR

) coupled with weight recalibration, the authors demonstrate accelerated convergence across multiple MIM baselines (SimMIM, MAE, BootMAE, ConMIM) and backbone sizes (ViT‑B/L). Empirical results use the RAUC metric to quantify faster pre-training, showing reductions of roughly 50% in required epochs to reach baseline performance, along with consistent performance gains. The proposed approach is versatile, improving efficiency without altering core MIM objectives, and highlights data singularity as a central design principle for masked tokens, with a caveat that these properties remain dynamic and open to refinement.

Abstract

Paper Structure (17 sections, 8 equations, 4 figures, 3 tables)

This paper contains 17 sections, 8 equations, 4 figures, 3 tables.

Introduction
Preliminaries
Analysis
Heterogeneity Measure via Entropy
Heterogeneity Analysis
Heterogeneity analysis on pre-trained models of different methods
Heterogeneity analysis of converged and non-converged models
Masked Token Optimization
Experiments
Metric for Efficient Pre-training
Baseline Models
Performance Comparisons
Ablation on Objectives
Related Work
Masked Language Modeling
...and 2 more sections

Figures (4)

Figure 1: To investigate the heterogeneity between masked token and visible token, we analyze the pre-trained models of the recent approaches xie2022simmimhe2022masked. (a) shows that the heterogeneity between two distinct types of tokens is highest on the initial embedding for both approaches, and it gradually decreases in subsequent layers. Unlike the pre-trained model, the heterogeneity of the non-converged model shown in (b) displays an erratic trend, indicating that the tendency of heterogeneity is acquired through model convergence.
Figure 2: We present the affinity map between every token pair for each layer of the pre-trained model xie2022simmim. Affinity maps are listed in order from initial embedding to subsequent layers, and the x-axis and y-axis of the affinity map are both arranged in the order of masked token $X_M$ and the visible image token $X_V$. Min-max normalization was used for the visualization of the affinity maps.
Figure 3: The proposed Masked Token Optimization (MTO) approach encompasses the selective exclusion of semantically inconsequential masked tokens from the weight aggregation process pertaining to visible tokens with (\ref{['eq:spa_loss']}), and at the same time, it enforces data singularity constraints (\ref{['eq:entropy_maximization']}) and (\ref{['eq:rank_loss']}) based on the depth of the layer to enhance the model's capability to accurately identify regions necessitating semantic restoration.
Figure 4: The comprehensive performance results of applying MTO to various baselines xie2022simmimhe2022maskeddong2022bootstrappedyi2022masked. MTO achieves a substantial improvement in the efficiency of pre-training by attaining the standard performance within approximately 400 epochs across all baseline methods in common. This signifies that remarkable enhancement in efficiency is achievable across any MIM method through the application of MTO, rendering it a viable option for masked tokens.

Theorems & Definitions (1)

proof

Emerging Property of Masked Token for Effective Pre-training

TL;DR

Abstract

Emerging Property of Masked Token for Effective Pre-training

Authors

TL;DR

Abstract

Table of Contents

Figures (4)

Theorems & Definitions (1)