Enhancing Vision-Language Model with Unmasked Token Alignment
Jihao Liu, Jinliang Zheng, Boxiao Liu, Yu Liu, Hongsheng Li
TL;DR
The paper tackles the costly regime of vision-language pre-training by enhancing CLIP representations through Unmasked Token Alignment (UTA). UTA trains a Vision Transformer (ViT) from scratch by aligning its unmasked visual tokens to corresponding tokens from a frozen CLIP vision encoder, using a dense token-wise objective and a reversed masking strategy to improve spatial coverage, thereby avoiding [MASK] tokens and enabling zero-shot evaluation with the CLIP text encoder. Empirically, UTA yields strong zero-shot performance and competitive multi-modal and uni-modal results, outperforming several Masked Image Modeling baselines and showing notable gains on benchmarks like ImageNet, COCO/LVIS, and LLaVA-Bench, while reducing training FLOPs. The work demonstrates that leveraging a frozen CLIP teacher for token-level alignment provides an efficient and effective pathway to cross-modal representations with practical zero-shot capabilities and robust downstream transfer.
Abstract
Contrastive pre-training on image-text pairs, exemplified by CLIP, becomes a standard technique for learning multi-modal visual-language representations. Although CLIP has demonstrated remarkable performance, training it from scratch on noisy web-scale datasets is computationally demanding. On the other hand, mask-then-predict pre-training approaches, like Masked Image Modeling (MIM), offer efficient self-supervised learning for single-modal representations. This paper introduces Unmasked Token Alignment (UTA), a method that leverages existing CLIP models to further enhance its vision-language representations. UTA trains a Vision Transformer (ViT) by aligning unmasked visual tokens to the corresponding image tokens from a frozen CLIP vision encoder, which automatically aligns the ViT model with the CLIP text encoder. The pre-trained ViT can be directly applied for zero-shot evaluation even without training on image-text pairs. Compared to MIM approaches, UTA does not suffer from training-finetuning inconsistency and is much more training-efficient by avoiding using the extra [MASK] tokens. Extensive experimental results demonstrate that UTA can enhance CLIP models and outperform existing MIM methods on various uni- and multi-modal benchmarks. Code and models are available at https://github.com/jihaonew/UTA.
