Enhancing CLIP Conceptual Embedding through Knowledge Distillation
Kuei-Chun Kao
TL;DR
This work presents Knowledge-CLIP, a framework to augment CLIP by incorporating knowledge from the large language model Llama 2 through knowledge distillation, soft concept labeling, and contrastive learning. It introduces Text Embedding Distillation to align CLIP's text encoder with Llama 2, Concept Learning to impose soft concept labels via K-means on Llama 2 embeddings, and a standard CLIP-like Contrastive Objective to maintain cross-modal alignment. The approach yields measurable gains in both text- and image-encoder quality: the text evaluation shows Knowledge-CLIP's EM on CC3M lying between Llama 2 and CLIP, while the image evaluation on AWA2 and CUB shows modest improvements over CLIP. These results indicate that external knowledge can enrich multimodal embeddings and improve downstream understanding, with room for broader evaluation and hyperparameter tuning in future work.
Abstract
Recently, CLIP has become an important model for aligning images and text in multi-modal contexts. However, researchers have identified limitations in the ability of CLIP's text and image encoders to extract detailed knowledge from pairs of captions and images. In response, this paper presents Knowledge-CLIP, an innovative approach designed to improve CLIP's performance by integrating a new knowledge distillation (KD) method based on Llama 2. Our approach focuses on three key objectives: Text Embedding Distillation, Concept Learning, and Contrastive Learning. First, Text Embedding Distillation involves training the Knowledge-CLIP text encoder to mirror the teacher model, Llama 2. Next, Concept Learning assigns a soft concept label to each caption-image pair by employing offline K-means clustering on text data from Llama 2, enabling Knowledge-CLIP to learn from these soft concept labels. Lastly, Contrastive Learning aligns the text and image embeddings. Our experimental findings show that the proposed model improves the performance of both text and image encoders.
