Table of Contents
Fetching ...

CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance

Chu Myaet Thwal, Ye Lin Tun, Minh N. H. Nguyen, Eui-Nam Huh, Choong Seon Hong

TL;DR

CLIP-PING introduces a resource-efficient training paradigm for lightweight vision-language models by leveraging Proximus Intrinsic Neighbors Guidance from frozen unimodal encoders. It combines intra-modal NN and inter-modal XNN supervision derived from auxiliary feature banks to enrich cross-modal learning without heavy distillation or large compute. The method yields consistent gains in zero-shot classification and cross-modal retrieval across multiple lightweight architectures, with additional benefits when using an active-teacher variant (A-CLIP-PING). The approach demonstrates strong transferability under linear evaluation and offers a practical route to deploy capable vision-language models in data- and compute-constrained settings.

Abstract

Beyond the success of Contrastive Language-Image Pre-training (CLIP), recent trends mark a shift toward exploring the applicability of lightweight vision-language models for resource-constrained scenarios. These models often deliver suboptimal performance when relying solely on a single image-text contrastive learning objective, spotlighting the need for more effective training mechanisms that guarantee robust cross-modal feature alignment. In this work, we propose CLIP-PING: Contrastive Language-Image Pre-training with Proximus Intrinsic Neighbors Guidance, a novel yet simple and efficient training paradigm designed to boost the performance of lightweight vision-language models with minimal computational overhead and lower data demands. CLIP-PING bootstraps unimodal features extracted from arbitrary pre-trained encoders to obtain intrinsic guidance of proximus neighbor samples, i.e., nearest-neighbor (NN) and cross nearest-neighbor (XNN). We find that extra contrastive supervision from these neighbors substantially boosts cross-modal alignment, enabling lightweight models to learn more generic features with rich semantic diversity. Extensive experiments reveal that CLIP-PING notably surpasses its peers in zero-shot generalization and cross-modal retrieval tasks. Specifically, a 5.5% gain on zero-shot ImageNet1K classification with 10.7% (I2T) and 5.7% (T2I) on Flickr30K retrieval, compared to the original CLIP when using ViT-XS image encoder trained on 3 million (image, text) pairs. Moreover, CLIP-PING showcases a strong transferability under the linear evaluation protocol across several downstream tasks.

CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance

TL;DR

CLIP-PING introduces a resource-efficient training paradigm for lightweight vision-language models by leveraging Proximus Intrinsic Neighbors Guidance from frozen unimodal encoders. It combines intra-modal NN and inter-modal XNN supervision derived from auxiliary feature banks to enrich cross-modal learning without heavy distillation or large compute. The method yields consistent gains in zero-shot classification and cross-modal retrieval across multiple lightweight architectures, with additional benefits when using an active-teacher variant (A-CLIP-PING). The approach demonstrates strong transferability under linear evaluation and offers a practical route to deploy capable vision-language models in data- and compute-constrained settings.

Abstract

Beyond the success of Contrastive Language-Image Pre-training (CLIP), recent trends mark a shift toward exploring the applicability of lightweight vision-language models for resource-constrained scenarios. These models often deliver suboptimal performance when relying solely on a single image-text contrastive learning objective, spotlighting the need for more effective training mechanisms that guarantee robust cross-modal feature alignment. In this work, we propose CLIP-PING: Contrastive Language-Image Pre-training with Proximus Intrinsic Neighbors Guidance, a novel yet simple and efficient training paradigm designed to boost the performance of lightweight vision-language models with minimal computational overhead and lower data demands. CLIP-PING bootstraps unimodal features extracted from arbitrary pre-trained encoders to obtain intrinsic guidance of proximus neighbor samples, i.e., nearest-neighbor (NN) and cross nearest-neighbor (XNN). We find that extra contrastive supervision from these neighbors substantially boosts cross-modal alignment, enabling lightweight models to learn more generic features with rich semantic diversity. Extensive experiments reveal that CLIP-PING notably surpasses its peers in zero-shot generalization and cross-modal retrieval tasks. Specifically, a 5.5% gain on zero-shot ImageNet1K classification with 10.7% (I2T) and 5.7% (T2I) on Flickr30K retrieval, compared to the original CLIP when using ViT-XS image encoder trained on 3 million (image, text) pairs. Moreover, CLIP-PING showcases a strong transferability under the linear evaluation protocol across several downstream tasks.

Paper Structure

This paper contains 49 sections, 17 equations, 5 figures, 24 tables.

Figures (5)

  • Figure 1: Comparison on zero-shot classification and retrieval performance using the ViT-XS dosovitskiy2020image image encoder, trained on COCO+CC3M lin2014microsoftsharma2018conceptual dataset with 3M (image, text) pairs.
  • Figure 2: Example of nearest-neighbor (NN) and cross nearest-neighbor (XNN) samples for (image, text) pair, i.e., $(I_k,T_k)$, from the COCO lin2014microsoft dataset.
  • Figure 3: Overview of the CLIP-PING pipeline. Unimodal feature extraction is performed prior to the multi-modal training, with extracted features stored frozen in auxiliary feature banks. Each feature support set is a representative of the corresponding auxiliary feature bank.
  • Figure 4: Example for nearest-neighbor (NN) and cross nearest-neighbor (XNN) retrieval process illustrated in a right-to-left order.
  • Figure 5: Curves for ViT-XS dosovitskiy2020image + MoblileBERTTINYsun2020mobilebert.