LMM-Regularized CLIP Embeddings for Image Classification
Maria Tzelepi, Vasileios Mezaris
TL;DR
This work addresses improving image classification with CLIP by injecting knowledge from a Large Multimodal Model (LMM) during training. It prompts MiniGPT-4 to generate per-image semantic descriptions, converts them to mean class descriptions via the frozen CLIP text encoder, and imposes an auxiliary Euclidean alignment loss that pulls CLIP's image embeddings toward these means, yielding $J_{total} = J_{ce} + \alpha J_{reg}$ with $J_{reg} = \sum_i \|\mathbf{x}_i - \mathbf{C}_{l}^i\|_2^2$. The approach enhances the discriminability of the CLIP image embeddings and achieves state-of-the-art performance on UCF-101, ERA, and BAR datasets, illustrating the practical value of LMM-guided regularization for multimodal vision tasks. Overall, the method demonstrates that sample-specific, LMM-derived semantic supervision can be effectively integrated into CLIP training to boost downstream classification accuracy.
Abstract
In this paper we deal with image classification tasks using the powerful CLIP vision-language model. Our goal is to advance the classification performance using the CLIP's image encoder, by proposing a novel Large Multimodal Model (LMM) based regularization method. The proposed method uses an LMM to extract semantic descriptions for the images of the dataset. Then, it uses the CLIP's text encoder, frozen, in order to obtain the corresponding text embeddings and compute the mean semantic class descriptions. Subsequently, we adapt the CLIP's image encoder by adding a classification head, and we train it along with the image encoder output, apart from the main classification objective, with an additional auxiliary objective. The additional objective forces the embeddings at the image encoder's output to become similar to their corresponding LMM-generated mean semantic class descriptions. In this way, it produces embeddings with enhanced discrimination ability, leading to improved classification performance. The effectiveness of the proposed regularization method is validated through extensive experiments on three image classification datasets.
