Table of Contents
Fetching ...

Morphing Tokens Draw Strong Masked Image Models

Taekyung Kim, Byeongho Heo, Dongyoon Han

TL;DR

This work identifies spatial inconsistency in token-level supervision for masked image modeling and demonstrates that naive token aggregation fails to address the underlying noise. It introduces Dynamic Token Morphing (DTM), a context-preserving, dynamically scheduled token aggregation method that aligns morphed online and target tokens via a morphing matrix and multiple DTM losses. Empirically, DTM yields state-of-the-art results on ImageNet-1K and competitive performance on ADE20K, with improved transferability to diverse datasets and robust applicability across SSL frameworks. The findings suggest that maintaining context while morphing tokens can substantially enhance discriminability and training efficiency in vision transformers.

Abstract

Masked image modeling (MIM) has emerged as a promising approach for pre-training Vision Transformers (ViTs). MIMs predict masked tokens token-wise to recover target signals that are tokenized from images or generated by pre-trained models like vision-language models. While using tokenizers or pre-trained models is viable, they often offer spatially inconsistent supervision even for neighboring tokens, hindering models from learning discriminative representations. Our pilot study identifies spatial inconsistency in supervisory signals and suggests that addressing it can improve representation learning. Building upon this insight, we introduce Dynamic Token Morphing (DTM), a novel method that dynamically aggregates tokens while preserving context to generate contextualized targets, thereby likely reducing spatial inconsistency. DTM is compatible with various SSL frameworks; we showcase significantly improved MIM results, barely introducing extra training costs. Our method facilitates MIM training by using more spatially consistent targets, resulting in improved training trends as evidenced by lower losses. Experiments on ImageNet-1K and ADE20K demonstrate DTM's superiority, which surpasses complex state-of-the-art MIM methods. Furthermore, the evaluation of transfer learning on downstream tasks like iNaturalist, along with extensive empirical studies, supports DTM's effectiveness.

Morphing Tokens Draw Strong Masked Image Models

TL;DR

This work identifies spatial inconsistency in token-level supervision for masked image modeling and demonstrates that naive token aggregation fails to address the underlying noise. It introduces Dynamic Token Morphing (DTM), a context-preserving, dynamically scheduled token aggregation method that aligns morphed online and target tokens via a morphing matrix and multiple DTM losses. Empirically, DTM yields state-of-the-art results on ImageNet-1K and competitive performance on ADE20K, with improved transferability to diverse datasets and robust applicability across SSL frameworks. The findings suggest that maintaining context while morphing tokens can substantially enhance discriminability and training efficiency in vision transformers.

Abstract

Masked image modeling (MIM) has emerged as a promising approach for pre-training Vision Transformers (ViTs). MIMs predict masked tokens token-wise to recover target signals that are tokenized from images or generated by pre-trained models like vision-language models. While using tokenizers or pre-trained models is viable, they often offer spatially inconsistent supervision even for neighboring tokens, hindering models from learning discriminative representations. Our pilot study identifies spatial inconsistency in supervisory signals and suggests that addressing it can improve representation learning. Building upon this insight, we introduce Dynamic Token Morphing (DTM), a novel method that dynamically aggregates tokens while preserving context to generate contextualized targets, thereby likely reducing spatial inconsistency. DTM is compatible with various SSL frameworks; we showcase significantly improved MIM results, barely introducing extra training costs. Our method facilitates MIM training by using more spatially consistent targets, resulting in improved training trends as evidenced by lower losses. Experiments on ImageNet-1K and ADE20K demonstrate DTM's superiority, which surpasses complex state-of-the-art MIM methods. Furthermore, the evaluation of transfer learning on downstream tasks like iNaturalist, along with extensive empirical studies, supports DTM's effectiveness.
Paper Structure (26 sections, 5 equations, 8 figures, 16 tables, 1 algorithm)

This paper contains 26 sections, 5 equations, 8 figures, 16 tables, 1 algorithm.

Figures (8)

  • Figure 1: What is spatial consistency among visual tokens? We schematically visualize token-wise zero-shot classification results to illustrate the spatially inconsistent token predictions. With the input image (a), the following results (b) and (c) display the predicted classes for each token within four example bounding boxes without/with token aggregations, respectively. We depict the differences between the predicted and ground-truth classes by varying the lightness of red, whereas the green represents the correct prediction. Each result yields 113 corrected tokens with aggregation and 82 without aggregation out of a total of 196 tokens, respectively; aggregation gives fewer spatially inconsistent representations. The zero-shot accuracies (reported in Table \ref{['tab:pt_models_linprob']}) support spatial consistency's connection to the model's ability.
  • Figure 2: Representation learning with various supervisions. We illustrate our study's base representation learning framework along with different supervisory signal functions $f$. We evaluate four variants of the distillation target: 1) token-wise supervision (baseline); 2) downsampled supervision; 3) supervision after bipartite matching layer-wise; 4) superpixel supervision; 5) supervision by token morphing.
  • Figure 3: Token morphing offers diverse contextualized signals. Dynamic Token Morphing (DTM) aligns token representations by dynamically aggregating contextually related tokens to create more diverse and diversified targets. Blue and Green tokens denote the representations of the image patches processed by online and target models, respectively. Gray tokens denote masked tokens.
  • Figure 4: Overview of Masked Image Modeling via Dynamic Token Morphing (DTM). For a token morphing schedule of DTM, we aggregate the dynamic range of tokens using morphing matrix $M$ derived from target tokens $\{\textbf{v}_i\}_{i=1}^N$. Specifically, we randomly sample a number of remaining tokens $\bar{n}$ and an iteration number $k$ to dynamically schedule token morphing (i.e., $\{r_p\}_{p=1}^{k}$), forming $\bar{n}$ morphed tokens $\{\hat{\textbf{u}}_i\}_{i=1}^{\bar{n}}$ and $\{\hat{\textbf{v}}_i\}_{i=1}^{\bar{n}}$. Then, we align representations of the corresponding online and target morphed tokens.
  • Figure 5: Illustrative description of morphing matrix $M=\Pi_{p=1}^k \bar{M}^p$. In the illustration of the morphing matrix $M$, green and white entries denote $M_{ij}=1$ and $M_{ij}=0$, respectively, where the $(i,j)$-th entry indicates whether the $j$-th token representations $\textbf{v}_j$ is aggregated into the $i$-th morphed token representations $\hat{\textbf{v}}_i$. Multiplying the morphing matrix $M$ by the token representations $\{\textbf{v}_j\}^N_{j=1}$ with subsequent normalization via the number of the aggregated tokens $\sum_j M_{ij}$ yields morphed token representations $\{\hat{\textbf{v}_j}\}^N_{j=1}$, as formulated in \ref{['eq:morphing']}. If we distribute the morphed tokens to their aggregated tokens and arrange the tokens, then we can achieve image representations with smoothed representations. Note that a morphing matrix is generated for each morphing case, as shown in Fig. \ref{['fig:representative']}.
  • ...and 3 more figures