Table of Contents
Fetching ...

Centroid-centered Modeling for Efficient Vision Transformer Pre-training

Xin Yan, Zuchao Li, Lefei Zhang

TL;DR

CCViT introduces a centroid-centered, non-parametric tokenizer for Masked Image Modeling by applying k-means to image patches to produce centroids that function as both patch pixels and token IDs. The framework uses blockwise masking plus centroid replacement and a two-branch ViT backbone to learn both token predictions and pixel reconstructions, optimizing a joint loss L_CIM = L_CE + L_MSE. Empirically, CCViT achieves 84.3% top-1 on ImageNet-1K with ViT-B and 86.0% with ViT-L, and 48.4 mIoU on ADE20K (ViT-B), demonstrating competitive performance without external data; the centroid tokenizer can be constructed in seconds and requires far less resources than parametric tokenizers. Ablation studies show that learning both token and pixel targets and using random replacement improve results, and analyses indicate superior robustness and efficiency of the centroid-based tokenizer compared to BEiT/BEiTv2.

Abstract

Masked Image Modeling (MIM) is a new self-supervised vision pre-training paradigm using a Vision Transformer (ViT). Previous works can be pixel-based or token-based, using original pixels or discrete visual tokens from parametric tokenizer models, respectively. Our proposed centroid-based approach, CCViT, leverages k-means clustering to obtain centroids for image modeling without supervised training of the tokenizer model, which only takes seconds to create. This non-parametric centroid tokenizer only takes seconds to create and is faster for token inference. The centroids can represent both patch pixels and index tokens with the property of local invariance. Specifically, we adopt patch masking and centroid replacing strategies to construct corrupted inputs, and two stacked encoder blocks to predict corrupted patch tokens and reconstruct original patch pixels. Experiments show that our CCViT achieves 84.4% top-1 accuracy on ImageNet-1K classification with ViT-B and 86.0% with ViT-L. We also transfer our pre-trained model to other downstream tasks. Our approach achieves competitive results with recent baselines without external supervision and distillation training from other models.

Centroid-centered Modeling for Efficient Vision Transformer Pre-training

TL;DR

CCViT introduces a centroid-centered, non-parametric tokenizer for Masked Image Modeling by applying k-means to image patches to produce centroids that function as both patch pixels and token IDs. The framework uses blockwise masking plus centroid replacement and a two-branch ViT backbone to learn both token predictions and pixel reconstructions, optimizing a joint loss L_CIM = L_CE + L_MSE. Empirically, CCViT achieves 84.3% top-1 on ImageNet-1K with ViT-B and 86.0% with ViT-L, and 48.4 mIoU on ADE20K (ViT-B), demonstrating competitive performance without external data; the centroid tokenizer can be constructed in seconds and requires far less resources than parametric tokenizers. Ablation studies show that learning both token and pixel targets and using random replacement improve results, and analyses indicate superior robustness and efficiency of the centroid-based tokenizer compared to BEiT/BEiTv2.

Abstract

Masked Image Modeling (MIM) is a new self-supervised vision pre-training paradigm using a Vision Transformer (ViT). Previous works can be pixel-based or token-based, using original pixels or discrete visual tokens from parametric tokenizer models, respectively. Our proposed centroid-based approach, CCViT, leverages k-means clustering to obtain centroids for image modeling without supervised training of the tokenizer model, which only takes seconds to create. This non-parametric centroid tokenizer only takes seconds to create and is faster for token inference. The centroids can represent both patch pixels and index tokens with the property of local invariance. Specifically, we adopt patch masking and centroid replacing strategies to construct corrupted inputs, and two stacked encoder blocks to predict corrupted patch tokens and reconstruct original patch pixels. Experiments show that our CCViT achieves 84.4% top-1 accuracy on ImageNet-1K classification with ViT-B and 86.0% with ViT-L. We also transfer our pre-trained model to other downstream tasks. Our approach achieves competitive results with recent baselines without external supervision and distillation training from other models.
Paper Structure (25 sections, 5 equations, 7 figures, 11 tables)

This paper contains 25 sections, 5 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: The proposed CCViT architecture. We view centroids as two aspects, token ids, and patch pixels. Our centroid-centered pre-training aims at both predicting centroid indices and reconstructing image patch pixels. We apply the blockwise mask to some patches (e.g., 40%) and replace some of the remaining patches (e.g., 10%) with the corresponding centroids. All corrupted patches are fed into the ViT Block.
  • Figure 2: Overview of our CCViT. Before pre-training, we use k-means clustering on vanilla pixel patches to achieve centroids. During pre-training, we mask out some patches and randomly replace some of the remained patches using centroids. All the patches are fed into the encoder. The pre-training objectives are both centroid index tokens and original pixels.
  • Figure 3: Comparison of pre-training architectures between BEiT, MAE, and ours.
  • Figure 4: Visualization of different image noises in Table \ref{['table: cmp_tokenizer']} and Table \ref{['table: cmp_2']}.
  • Figure 5: Reconstruction examples from ImageNet-1k validation dataset via BEiT, MAE and our CCViT.
  • ...and 2 more figures