KDMCSE: Knowledge Distillation Multimodal Sentence Embeddings with Adaptive Angular margin Contrastive Learning

Cong-Duy Nguyen; Thong Nguyen; Xiaobao Wu; Anh Tuan Luu

KDMCSE: Knowledge Distillation Multimodal Sentence Embeddings with Adaptive Angular margin Contrastive Learning

Cong-Duy Nguyen, Thong Nguyen, Xiaobao Wu, Anh Tuan Luu

TL;DR

KDMCSE introduces knowledge distillation from a frozen CLIP teacher to improve multimodal sentence embeddings and reduces noisy negative sampling in contrastive learning. It adds AdapACSE, an Adaptive Angular Margin loss that scales the angular margin using $\Delta_{i,j} = |1-\alpha_{i,j}|$ with $\alpha_{i,j} = \frac{m_i^T n_j}{\|m_i\|_2 \|n_j\|_2}$ the cosine similarity between samples, together with a thresholding scheme to filter weak negatives. The method leverages both text and image modalities, learning from CLIP's soft labels while training a student language encoder. Experiments on standard STS benchmarks show consistent improvements over prior approaches, reflecting better alignment and uniformity and improved transfer to downstream tasks. This work advances robust, multimodal sentence representations with practical impact on real-world NLP applications.

Abstract

Previous work on multimodal sentence embedding has proposed multimodal contrastive learning and achieved promising results. However, by taking the rest of the batch as negative samples without reviewing when forming contrastive pairs, those studies encountered many suspicious and noisy negative examples, significantly affecting the methods' overall performance. In this work, we propose KDMCSE (Knowledge Distillation Multimodal contrastive learning of Sentence Embeddings), a novel approach that enhances the discrimination and generalizability of multimodal representation and inherits the knowledge from the teacher model to learn the difference between positive and negative instances and via that, can detect noisy and wrong negative samples effectively before they are calculated in the contrastive objective. Furthermore, to overcome the limitation of modeling the variation within negative pairs, we introduce a new contrastive objective, AdapACSE (Adaptive Angular Margin Supervised Contrastive Learning for Multimodal sentence embeddings), that enhances the discriminative representation by strengthening the margin within the angular space while capturing varying semantics within the negative. Experimental results on widely used Semantic Textual Similarity (STS) benchmarks demonstrate the effectiveness of our approach.

KDMCSE: Knowledge Distillation Multimodal Sentence Embeddings with Adaptive Angular margin Contrastive Learning

TL;DR

with

the cosine similarity between samples, together with a thresholding scheme to filter weak negatives. The method leverages both text and image modalities, learning from CLIP's soft labels while training a student language encoder. Experiments on standard STS benchmarks show consistent improvements over prior approaches, reflecting better alignment and uniformity and improved transfer to downstream tasks. This work advances robust, multimodal sentence representations with practical impact on real-world NLP applications.

Abstract

Paper Structure (30 sections, 16 equations, 12 figures, 5 tables, 1 algorithm)

This paper contains 30 sections, 16 equations, 12 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Sentence Representation Learning
Deep Metric Learning Objectives
Visually Grounded Representation Learning
Method
Background: Unsupervised SimCSE and Multimodal Contrastive Learning MCSE
Knowledge Distillation Multimodal Contrastive learning for Sentence Embedding
Adaptive Angular margin Contrastive learning
Experiments Setup
Dataset
Implementation
Language Encoder - Student model
Multimodal encoder - Teacher model
MLP Projection Heads
...and 15 more sections

Figures (12)

Figure 1: Example image-caption pairs in Flickr. Solid lines of the same color talk about the same instance, and a dot line means the additional information that does not occur in the other caption.
Figure 2: Example image-caption pairs in Flickr. The green caption is a true annotation of the image while 4 red captions is randomly picked from the dataset. The scores on the right are the cosine similarity between its caption representation and image representation extracted from CLIP model.
Figure 3: The overall architecture of KDMCSE. The upper part is the original SimCSE, the below part is the multimodal contrastive learning approach with knowledge distillation from CLIP model.
Figure 4: The overall framework of knowledge distillation with Adaptive Angular margin contrastive learning. The pipeline first calculates the soft-label similarity scores between text and visual representation, then we apply threshold filtering to remove the noisy negative pairs, and finally, we transfer the soft-label matrices into our proposed AdapACSE to flexibly find the margin. $s_{\hat{a}}$ is the positive sample for $s_a$, $v_b$ and $v_c$ are its negative counterparts. In particular, the difference in pairs $s_a-v_c$ is more pronounced than in $s_a-v_b$. As a result, the margin (depicted as a dashed line) for $c$ (in orange) is greater than for $b$ (in cyan).
Figure 5: Statistic of the maximum index of true captions when sorting the similarity score between image and text of Flickr dataset.
...and 7 more figures

KDMCSE: Knowledge Distillation Multimodal Sentence Embeddings with Adaptive Angular margin Contrastive Learning

TL;DR

Abstract

KDMCSE: Knowledge Distillation Multimodal Sentence Embeddings with Adaptive Angular margin Contrastive Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (12)