Simplifying CLIP: Unleashing the Power of Large-Scale Models on Consumer-level Computers
Hongbo Liu
TL;DR
This paper tackles the barrier of training CLIP-like models on consumer hardware by introducing SiCLIP, a lightweight framework that uses SAS-P blocks with weight sharing to shrink parameters and accelerate inference. It further improves data efficiency and convergence on small datasets through Weight Inheritance with Multi-Stage Knowledge Distillation (WIKD) and a Pair Matching (PM) loss, complemented by augmenting CC12M with synthetic captions to form CC12M-SYN. Empirical results show SiCLIP achieves a favorable data-scale-parameter-accuracy trade-off, delivering competitive zero-shot retrieval and classification while operating on an RTX3090 with 1 TB storage and offering practical CPU inference speed benefits. Overall, the work demonstrates that CLIP-like models can be effectively deployed on edge devices, broadening accessibility and speeding up real-world adoption of multimodal foundation models.
Abstract
Contrastive Language-Image Pre-training (CLIP) has attracted a surge of attention for its superior zero-shot performance and excellent transferability to downstream tasks. However, training such large-scale models usually requires substantial computation and storage, which poses barriers for general users with consumer-level computers. Motivated by this observation, in this paper we investigate how to achieve competitive performance on only one Nvidia RTX3090 GPU and with one terabyte for storing dataset. On one hand, we simplify the transformer block structure and combine Weight Inheritance with multi-stage Knowledge Distillation (WIKD), thereby reducing the parameters and improving the inference speed during training along with deployment. On the other hand, confronted with the convergence challenge posed by small dataset, we generate synthetic captions for each sample as data augmentation, and devise a novel Pair Matching (PM) loss to fully exploit the distinguishment among positive and negative image-text pairs. Extensive experiments demonstrate that our model can achieve a new state-of-the-art datascale-parameter-accuracy tradeoff, which could further popularize the CLIP model in the related research community.
