Table of Contents
Fetching ...

On the Role of Discrete Tokenization in Visual Representation Learning

Tianqi Du, Yifei Wang, Yisen Wang

TL;DR

The paper addresses how discrete tokenization in masked image modeling (MIM) shapes representation learning and downstream generalization. It develops a graph-based theory showing that tokenization induces equivalence classes in the target space and reshapes the augmentation graph connectivity, influencing generalization bounds. It introduces Token-Class Alignment Similarity (TCAS) as a training-free tokenizer-quality metric and ClusterMIM as a clustering-based tokenizer that yields strong empirical gains on ImageNet-100/1K across ViT backbones. The work argues that tokenizers aligned with true data classes improve intra-class connectivity and reduce inter-class confusion, providing a practical pathway to improved MIM-based representations without supervision.

Abstract

In the realm of self-supervised learning (SSL), masked image modeling (MIM) has gained popularity alongside contrastive learning methods. MIM involves reconstructing masked regions of input images using their unmasked portions. A notable subset of MIM methodologies employs discrete tokens as the reconstruction target, but the theoretical underpinnings of this choice remain underexplored. In this paper, we explore the role of these discrete tokens, aiming to unravel their benefits and limitations. Building upon the connection between MIM and contrastive learning, we provide a comprehensive theoretical understanding on how discrete tokenization affects the model's generalization capabilities. Furthermore, we propose a novel metric named TCAS, which is specifically designed to assess the effectiveness of discrete tokens within the MIM framework. Inspired by this metric, we contribute an innovative tokenizer design and propose a corresponding MIM method named ClusterMIM. It demonstrates superior performance on a variety of benchmark datasets and ViT backbones. Code is available at https://github.com/PKU-ML/ClusterMIM.

On the Role of Discrete Tokenization in Visual Representation Learning

TL;DR

The paper addresses how discrete tokenization in masked image modeling (MIM) shapes representation learning and downstream generalization. It develops a graph-based theory showing that tokenization induces equivalence classes in the target space and reshapes the augmentation graph connectivity, influencing generalization bounds. It introduces Token-Class Alignment Similarity (TCAS) as a training-free tokenizer-quality metric and ClusterMIM as a clustering-based tokenizer that yields strong empirical gains on ImageNet-100/1K across ViT backbones. The work argues that tokenizers aligned with true data classes improve intra-class connectivity and reduce inter-class confusion, providing a practical pathway to improved MIM-based representations without supervision.

Abstract

In the realm of self-supervised learning (SSL), masked image modeling (MIM) has gained popularity alongside contrastive learning methods. MIM involves reconstructing masked regions of input images using their unmasked portions. A notable subset of MIM methodologies employs discrete tokens as the reconstruction target, but the theoretical underpinnings of this choice remain underexplored. In this paper, we explore the role of these discrete tokens, aiming to unravel their benefits and limitations. Building upon the connection between MIM and contrastive learning, we provide a comprehensive theoretical understanding on how discrete tokenization affects the model's generalization capabilities. Furthermore, we propose a novel metric named TCAS, which is specifically designed to assess the effectiveness of discrete tokens within the MIM framework. Inspired by this metric, we contribute an innovative tokenizer design and propose a corresponding MIM method named ClusterMIM. It demonstrates superior performance on a variety of benchmark datasets and ViT backbones. Code is available at https://github.com/PKU-ML/ClusterMIM.
Paper Structure (16 sections, 1 theorem, 15 equations, 6 figures, 5 tables)

This paper contains 16 sections, 1 theorem, 15 equations, 6 figures, 5 tables.

Key Result

Theorem 1

Assuming that $\mathcal{M}(x_1|x_2)>0$ occurs only if $y(x_1)=y(x_2)$, and let $\sim_y$ denote the equivalence relation on $\mathcal{X}_2$ where $x_2\sim_y x_2^+$ if and only if $y(x_2)=y(x_2^+)$. Then $\mathcal{S}^y=\mathcal{X}_2/\sim_y=\{\mathcal{S}^y_1,\dots,\mathcal{S}^y_c\}$ minimizes $c_1\sum_

Figures (6)

  • Figure 1: An illustration of how the discrete tokenization affects the mask graph and the corresponding augmentation graph. $x_2$ and $x'_2$ share the same discrete token, enabling a connection between $x_1$ and $x'_1$ through $x_2$ and $x'_2$, whereas such a connection is not possible in MAE.
  • Figure 2: Visual illustration of the three tokenization approaches in the toy model. Each orange bounding box represents an equivalence class, whose elements share the same discrete token. Class-wise tokenization exhibits a higher intra-class connectivity and lower inter-class connectivity compared to MAE-like tokenization. Consequently, it boasts a lower downstream error bound. In contrast, cross-class tokenization leads to lower intra-class connectivity and higher inter-class connectivity, resulting in a significantly larger downstream error bound.
  • Figure 3: Correlation between TCAS and linear probing accuracy.
  • Figure 4: Ablation study on the selection of clustering number $K$. Linear probing accuracy / Fine-tuning accuracy in each box.
  • Figure 5: Experiments exploring different K-Means training epochs. Linear probing accuracy / Fine-tuning accuracy in each box.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Theorem 1