Table of Contents
Fetching ...

Enhancing Vision-Language Model with Unmasked Token Alignment

Jihao Liu, Jinliang Zheng, Boxiao Liu, Yu Liu, Hongsheng Li

TL;DR

The paper tackles the costly regime of vision-language pre-training by enhancing CLIP representations through Unmasked Token Alignment (UTA). UTA trains a Vision Transformer (ViT) from scratch by aligning its unmasked visual tokens to corresponding tokens from a frozen CLIP vision encoder, using a dense token-wise objective and a reversed masking strategy to improve spatial coverage, thereby avoiding [MASK] tokens and enabling zero-shot evaluation with the CLIP text encoder. Empirically, UTA yields strong zero-shot performance and competitive multi-modal and uni-modal results, outperforming several Masked Image Modeling baselines and showing notable gains on benchmarks like ImageNet, COCO/LVIS, and LLaVA-Bench, while reducing training FLOPs. The work demonstrates that leveraging a frozen CLIP teacher for token-level alignment provides an efficient and effective pathway to cross-modal representations with practical zero-shot capabilities and robust downstream transfer.

Abstract

Contrastive pre-training on image-text pairs, exemplified by CLIP, becomes a standard technique for learning multi-modal visual-language representations. Although CLIP has demonstrated remarkable performance, training it from scratch on noisy web-scale datasets is computationally demanding. On the other hand, mask-then-predict pre-training approaches, like Masked Image Modeling (MIM), offer efficient self-supervised learning for single-modal representations. This paper introduces Unmasked Token Alignment (UTA), a method that leverages existing CLIP models to further enhance its vision-language representations. UTA trains a Vision Transformer (ViT) by aligning unmasked visual tokens to the corresponding image tokens from a frozen CLIP vision encoder, which automatically aligns the ViT model with the CLIP text encoder. The pre-trained ViT can be directly applied for zero-shot evaluation even without training on image-text pairs. Compared to MIM approaches, UTA does not suffer from training-finetuning inconsistency and is much more training-efficient by avoiding using the extra [MASK] tokens. Extensive experimental results demonstrate that UTA can enhance CLIP models and outperform existing MIM methods on various uni- and multi-modal benchmarks. Code and models are available at https://github.com/jihaonew/UTA.

Enhancing Vision-Language Model with Unmasked Token Alignment

TL;DR

The paper tackles the costly regime of vision-language pre-training by enhancing CLIP representations through Unmasked Token Alignment (UTA). UTA trains a Vision Transformer (ViT) from scratch by aligning its unmasked visual tokens to corresponding tokens from a frozen CLIP vision encoder, using a dense token-wise objective and a reversed masking strategy to improve spatial coverage, thereby avoiding [MASK] tokens and enabling zero-shot evaluation with the CLIP text encoder. Empirically, UTA yields strong zero-shot performance and competitive multi-modal and uni-modal results, outperforming several Masked Image Modeling baselines and showing notable gains on benchmarks like ImageNet, COCO/LVIS, and LLaVA-Bench, while reducing training FLOPs. The work demonstrates that leveraging a frozen CLIP teacher for token-level alignment provides an efficient and effective pathway to cross-modal representations with practical zero-shot capabilities and robust downstream transfer.

Abstract

Contrastive pre-training on image-text pairs, exemplified by CLIP, becomes a standard technique for learning multi-modal visual-language representations. Although CLIP has demonstrated remarkable performance, training it from scratch on noisy web-scale datasets is computationally demanding. On the other hand, mask-then-predict pre-training approaches, like Masked Image Modeling (MIM), offer efficient self-supervised learning for single-modal representations. This paper introduces Unmasked Token Alignment (UTA), a method that leverages existing CLIP models to further enhance its vision-language representations. UTA trains a Vision Transformer (ViT) by aligning unmasked visual tokens to the corresponding image tokens from a frozen CLIP vision encoder, which automatically aligns the ViT model with the CLIP text encoder. The pre-trained ViT can be directly applied for zero-shot evaluation even without training on image-text pairs. Compared to MIM approaches, UTA does not suffer from training-finetuning inconsistency and is much more training-efficient by avoiding using the extra [MASK] tokens. Extensive experimental results demonstrate that UTA can enhance CLIP models and outperform existing MIM methods on various uni- and multi-modal benchmarks. Code and models are available at https://github.com/jihaonew/UTA.
Paper Structure (22 sections, 4 figures, 8 tables)

This paper contains 22 sections, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Overview of Unmasked Token Alignment (UTA). During the pre-training of UTA, only the unmasked tokens are inputted into the vision encoder and aligned with the CLIP vision encoder. After pre-training, the pre-trained vision encoder is automatically aligned with the CLIP text encoder and can be directly applied for the zero-shot evaluation even without contrastive training on image-text pairs. The pre-trained vision encoder can be further fine-tuned for uni-modal or multi-modal downstream tasks.
  • Figure 2: Qualitative examples generated by LLaVA models fine-tuned with EVA-02 and UTA.
  • Figure 3: Masking probabilities of different locations. The probabilities are calculated by averaging over 5000 random samples.
  • Figure 4: Training and validation loss over the pre-training process.