Table of Contents
Fetching ...

Multi-Token Enhancing for Vision Representation Learning

Zhong-Yu Li, Yu-Song Hu, Bo-Wen Yin, Ming-Ming Cheng

TL;DR

Multi-Token Enhancing (MTE) is introduced that extracts multiple auxiliary tokens simultaneously from a single model to enhance representation learning, while incurring minimal additional training costs and no additional inference costs.

Abstract

Vision representation learning, especially self-supervised learning, is pivotal for various vision applications. Ensemble learning has also succeeded in enhancing the performance and robustness of the vision models. However, traditional ensemble strategies are impractical for representation learning, especially self-supervised representation learning that requires large-scale datasets and long schedules. This is because they require k times more training and inference computation costs for an ensemble of k models. Differently, we introduce Multi-Token Enhancing (MTE) that extracts multiple auxiliary tokens simultaneously from a single model to enhance representation learning, while incurring minimal additional training costs and no additional inference costs. These auxiliary tokens, including auxiliary CLS tokens and adaptively pooled tokens, capture complementary information due to their differences. Meanwhile, to address the increase in inference costs, we distill the knowledge acquired by the auxiliary tokens into a global token during pre-training. Consequently, we can discard the auxiliary tokens during inference without incurring additional costs. Our MTE is compatible with various self-supervised loss functions and architectures, consistently improving performances across different downstream tasks. Our source code will be made publicly available.

Multi-Token Enhancing for Vision Representation Learning

TL;DR

Multi-Token Enhancing (MTE) is introduced that extracts multiple auxiliary tokens simultaneously from a single model to enhance representation learning, while incurring minimal additional training costs and no additional inference costs.

Abstract

Vision representation learning, especially self-supervised learning, is pivotal for various vision applications. Ensemble learning has also succeeded in enhancing the performance and robustness of the vision models. However, traditional ensemble strategies are impractical for representation learning, especially self-supervised representation learning that requires large-scale datasets and long schedules. This is because they require k times more training and inference computation costs for an ensemble of k models. Differently, we introduce Multi-Token Enhancing (MTE) that extracts multiple auxiliary tokens simultaneously from a single model to enhance representation learning, while incurring minimal additional training costs and no additional inference costs. These auxiliary tokens, including auxiliary CLS tokens and adaptively pooled tokens, capture complementary information due to their differences. Meanwhile, to address the increase in inference costs, we distill the knowledge acquired by the auxiliary tokens into a global token during pre-training. Consequently, we can discard the auxiliary tokens during inference without incurring additional costs. Our MTE is compatible with various self-supervised loss functions and architectures, consistently improving performances across different downstream tasks. Our source code will be made publicly available.

Paper Structure

This paper contains 23 sections, 9 equations, 15 figures, 16 tables.

Figures (15)

  • Figure 1: Increasing the number of the proposed auxiliary tokens leads to greater improvements without incurring additional inference costs. We combine varying numbers of the auxiliary tokens for clustering (using the prototype layer proposed in caron2021emerging) and assess the performance using the normalized mutual information (NMI) between the generated pseudo labels and true labels. As the number of auxiliary tokens increases, they complement each other, resulting in enhanced performances. During inference, the auxiliary tokens are removed without any additional inference costs.
  • Figure 2: $k$-NN Top-1 accuracies when cooperating MTE with different methods, including MoBY xie2021moby, DINO caron2021emerging, and iBOT zhou2021ibot.
  • Figure 3: For effective training, our MTE employs additional auxiliary parts, which will be discarded during inference and fine-tuning.
  • Figure 4: The attention mask in the self-attention layers. $z_c$, $z_a$, and $z_p$ represent the CLS token, auxiliary CLS tokens, and patch tokens, respectively. White circles mean that the corresponding query is not allowed to attend to the corresponding key.
  • Figure 5: The visualization of the kernel weights in the large kernel convolutions used to generate adaptively pooling weights. Each black box corresponds to a randomly chosen channel and contains the $11\times 11$ kernel weights of six adaptively pooled tokens.
  • ...and 10 more figures