Table of Contents
Fetching ...

CLIP-Map: Structured Matrix Mapping for Parameter-Efficient CLIP Compression

Kangjie Zhang, Wenxuan Huang, Xin Zhou, Boxiang Zhou, Dejia Song, Yuan Xie, Baochang Zhang, Lizhuang Ma, Nemo Chen, Xu Tang, Yao Hu, Shaohui Lin

TL;DR

CLIP-Map tackles the high memory and computation demands of CLIP by replacing traditional select-based pruning with a learnable, Kronecker-factorized mapping that compresses weight blocks while preserving information. A Diagonal Inheritance Initialization stabilizes optimization, enabling efficient end-to-end learning of width-depth mappings, followed by a distillation-based retraining stage to transfer knowledge from the original teacher CLIP. The approach yields competitive or superior zero-shot retrieval and classification performance across varying compression ratios, with notable gains under aggressive compression and reduced training time. This mapping-based, end-to-end pipeline offers a practical route for deploying CLIP-like models on resource-constrained devices. It also demonstrates how structured weight mappings can maintain cross-modal alignment and performance while dramatically reducing parameter counts and compute.

Abstract

Contrastive Language-Image Pre-training (CLIP) has achieved widely applications in various computer vision tasks, e.g., text-to-image generation, Image-Text retrieval and Image captioning. However, CLIP suffers from high memory and computation cost, which prohibits its usage to the resource-limited application scenarios. Existing CLIP compression methods typically reduce the size of pre-trained CLIP weights by selecting their subset as weight inheritance for further retraining via mask optimization or important weight measurement. However, these select-based weight inheritance often compromises the feature presentation ability, especially on the extreme compression. In this paper, we propose a novel mapping-based CLIP compression framework, CLIP-Map. It leverages learnable matrices to map and combine pretrained weights by Full-Mapping with Kronecker Factorization, aiming to preserve as much information from the original weights as possible. To mitigate the optimization challenges introduced by the learnable mapping, we propose Diagonal Inheritance Initialization to reduce the distribution shifting problem for efficient and effective mapping learning. Extensive experimental results demonstrate that the proposed CLIP-Map outperforms select-based frameworks across various compression ratios, with particularly significant gains observed under high compression settings.

CLIP-Map: Structured Matrix Mapping for Parameter-Efficient CLIP Compression

TL;DR

CLIP-Map tackles the high memory and computation demands of CLIP by replacing traditional select-based pruning with a learnable, Kronecker-factorized mapping that compresses weight blocks while preserving information. A Diagonal Inheritance Initialization stabilizes optimization, enabling efficient end-to-end learning of width-depth mappings, followed by a distillation-based retraining stage to transfer knowledge from the original teacher CLIP. The approach yields competitive or superior zero-shot retrieval and classification performance across varying compression ratios, with notable gains under aggressive compression and reduced training time. This mapping-based, end-to-end pipeline offers a practical route for deploying CLIP-like models on resource-constrained devices. It also demonstrates how structured weight mappings can maintain cross-modal alignment and performance while dramatically reducing parameter counts and compute.

Abstract

Contrastive Language-Image Pre-training (CLIP) has achieved widely applications in various computer vision tasks, e.g., text-to-image generation, Image-Text retrieval and Image captioning. However, CLIP suffers from high memory and computation cost, which prohibits its usage to the resource-limited application scenarios. Existing CLIP compression methods typically reduce the size of pre-trained CLIP weights by selecting their subset as weight inheritance for further retraining via mask optimization or important weight measurement. However, these select-based weight inheritance often compromises the feature presentation ability, especially on the extreme compression. In this paper, we propose a novel mapping-based CLIP compression framework, CLIP-Map. It leverages learnable matrices to map and combine pretrained weights by Full-Mapping with Kronecker Factorization, aiming to preserve as much information from the original weights as possible. To mitigate the optimization challenges introduced by the learnable mapping, we propose Diagonal Inheritance Initialization to reduce the distribution shifting problem for efficient and effective mapping learning. Extensive experimental results demonstrate that the proposed CLIP-Map outperforms select-based frameworks across various compression ratios, with particularly significant gains observed under high compression settings.
Paper Structure (27 sections, 12 equations, 6 figures, 12 tables)

This paper contains 27 sections, 12 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Select-based Compression Method and Mapping-based Growth Method.
  • Figure 2: In the mapping-learning stage, we freeze original model's parameters and train mapping parameters only. In the retraining stage, we use knowledge distillation to distill the student model initialized by mapping stage.
  • Figure 3: We firstly perform width-compression in both input-dimension and output-dimension on each layers parameter blocks. Then, we perform depth-compression to linear combining the compressed parameter blocks to a new layer parameter block.
  • Figure 4: Evolution of TR@1 on the MSCOCO test set using TinyCLIP and CLIP-Map(w and w/o distillation during mapping stage).
  • Figure 5: Changes of a mapping matrix in CLIP-Mapsmall mapping stage. Due to the significant scale difference between diagonal and off-diagonal weights, we adopt two separate color scales to enhance the visualization quality.
  • ...and 1 more figures