Table of Contents
Fetching ...

LMM-Regularized CLIP Embeddings for Image Classification

Maria Tzelepi, Vasileios Mezaris

TL;DR

This work addresses improving image classification with CLIP by injecting knowledge from a Large Multimodal Model (LMM) during training. It prompts MiniGPT-4 to generate per-image semantic descriptions, converts them to mean class descriptions via the frozen CLIP text encoder, and imposes an auxiliary Euclidean alignment loss that pulls CLIP's image embeddings toward these means, yielding $J_{total} = J_{ce} + \alpha J_{reg}$ with $J_{reg} = \sum_i \|\mathbf{x}_i - \mathbf{C}_{l}^i\|_2^2$. The approach enhances the discriminability of the CLIP image embeddings and achieves state-of-the-art performance on UCF-101, ERA, and BAR datasets, illustrating the practical value of LMM-guided regularization for multimodal vision tasks. Overall, the method demonstrates that sample-specific, LMM-derived semantic supervision can be effectively integrated into CLIP training to boost downstream classification accuracy.

Abstract

In this paper we deal with image classification tasks using the powerful CLIP vision-language model. Our goal is to advance the classification performance using the CLIP's image encoder, by proposing a novel Large Multimodal Model (LMM) based regularization method. The proposed method uses an LMM to extract semantic descriptions for the images of the dataset. Then, it uses the CLIP's text encoder, frozen, in order to obtain the corresponding text embeddings and compute the mean semantic class descriptions. Subsequently, we adapt the CLIP's image encoder by adding a classification head, and we train it along with the image encoder output, apart from the main classification objective, with an additional auxiliary objective. The additional objective forces the embeddings at the image encoder's output to become similar to their corresponding LMM-generated mean semantic class descriptions. In this way, it produces embeddings with enhanced discrimination ability, leading to improved classification performance. The effectiveness of the proposed regularization method is validated through extensive experiments on three image classification datasets.

LMM-Regularized CLIP Embeddings for Image Classification

TL;DR

This work addresses improving image classification with CLIP by injecting knowledge from a Large Multimodal Model (LMM) during training. It prompts MiniGPT-4 to generate per-image semantic descriptions, converts them to mean class descriptions via the frozen CLIP text encoder, and imposes an auxiliary Euclidean alignment loss that pulls CLIP's image embeddings toward these means, yielding with . The approach enhances the discriminability of the CLIP image embeddings and achieves state-of-the-art performance on UCF-101, ERA, and BAR datasets, illustrating the practical value of LMM-guided regularization for multimodal vision tasks. Overall, the method demonstrates that sample-specific, LMM-derived semantic supervision can be effectively integrated into CLIP training to boost downstream classification accuracy.

Abstract

In this paper we deal with image classification tasks using the powerful CLIP vision-language model. Our goal is to advance the classification performance using the CLIP's image encoder, by proposing a novel Large Multimodal Model (LMM) based regularization method. The proposed method uses an LMM to extract semantic descriptions for the images of the dataset. Then, it uses the CLIP's text encoder, frozen, in order to obtain the corresponding text embeddings and compute the mean semantic class descriptions. Subsequently, we adapt the CLIP's image encoder by adding a classification head, and we train it along with the image encoder output, apart from the main classification objective, with an additional auxiliary objective. The additional objective forces the embeddings at the image encoder's output to become similar to their corresponding LMM-generated mean semantic class descriptions. In this way, it produces embeddings with enhanced discrimination ability, leading to improved classification performance. The effectiveness of the proposed regularization method is validated through extensive experiments on three image classification datasets.

Paper Structure

This paper contains 8 sections, 3 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Proposed method: First we use MiniGPT-4 to extract semantic descriptions for each image of the dataset. Then, we use the CLIP's text encoder (frozen) in order to extract the corresponding text embeddings. Subsequently, we compute the mean semantic class descriptions. Finally, we modify the CLIP's image encoder by attaching a fully connected layer at the output of the encoder, and we train the modified model with the class labels using the cross entropy loss, along with an additional auxiliary objective that forces the image embeddings at the penultimate layer of the modified model to become similar to their corresponding mean class description.
  • Figure 2: Test accuracy throughout the training epochs for the proposed method against baseline.