Table of Contents
Fetching ...

MM-TS: Multi-Modal Temperature and Margin Schedules for Contrastive Learning with Long-Tail Data

Siarhei Sheludzko, Dhimitrios Duka, Bernt Schiele, Hilde Kuehne, Anna Kukleva

TL;DR

This work proposes Multi-Modal Temperature and Margin Schedules (MM-TS), and demonstrates that temperature scheduling can be effectively integrated within a max-margin framework, thereby unifying the two predominant approaches in multi-modal contrastive learning: InfoNCE loss and max-margin objective.

Abstract

Contrastive learning has become a fundamental approach in both uni-modal and multi-modal frameworks. This learning paradigm pulls positive pairs of samples closer while pushing negatives apart. In the uni-modal setting (e.g., image-based learning), previous research has shown that the strength of these forces can be controlled through the temperature parameter. In this work, we propose Multi-Modal Temperature and Margin Schedules (MM-TS), extending the concept of uni-modal temperature scheduling to multi-modal contrastive learning. Our method dynamically adjusts the temperature in the contrastive loss during training, modulating the attraction and repulsion forces in the multi-modal setting. Additionally, recognizing that standard multi-modal datasets often follow imbalanced, long-tail distributions, we adapt the temperature based on the local distribution of each training sample. Specifically, samples from dense clusters are assigned a higher temperature to better preserve their semantic structure. Furthermore, we demonstrate that temperature scheduling can be effectively integrated within a max-margin framework, thereby unifying the two predominant approaches in multi-modal contrastive learning: InfoNCE loss and max-margin objective. We evaluate our approach on four widely used image- and video-language datasets, Flickr30K, MSCOCO, EPIC-KITCHENS-100, and YouCook2, and show that our dynamic temperature and margin schedules improve performance and lead to new state-of-the-art results in the field.

MM-TS: Multi-Modal Temperature and Margin Schedules for Contrastive Learning with Long-Tail Data

TL;DR

This work proposes Multi-Modal Temperature and Margin Schedules (MM-TS), and demonstrates that temperature scheduling can be effectively integrated within a max-margin framework, thereby unifying the two predominant approaches in multi-modal contrastive learning: InfoNCE loss and max-margin objective.

Abstract

Contrastive learning has become a fundamental approach in both uni-modal and multi-modal frameworks. This learning paradigm pulls positive pairs of samples closer while pushing negatives apart. In the uni-modal setting (e.g., image-based learning), previous research has shown that the strength of these forces can be controlled through the temperature parameter. In this work, we propose Multi-Modal Temperature and Margin Schedules (MM-TS), extending the concept of uni-modal temperature scheduling to multi-modal contrastive learning. Our method dynamically adjusts the temperature in the contrastive loss during training, modulating the attraction and repulsion forces in the multi-modal setting. Additionally, recognizing that standard multi-modal datasets often follow imbalanced, long-tail distributions, we adapt the temperature based on the local distribution of each training sample. Specifically, samples from dense clusters are assigned a higher temperature to better preserve their semantic structure. Furthermore, we demonstrate that temperature scheduling can be effectively integrated within a max-margin framework, thereby unifying the two predominant approaches in multi-modal contrastive learning: InfoNCE loss and max-margin objective. We evaluate our approach on four widely used image- and video-language datasets, Flickr30K, MSCOCO, EPIC-KITCHENS-100, and YouCook2, and show that our dynamic temperature and margin schedules improve performance and lead to new state-of-the-art results in the field.
Paper Structure (19 sections, 12 equations, 11 figures, 12 tables)

This paper contains 19 sections, 12 equations, 11 figures, 12 tables.

Figures (11)

  • Figure 1: Individual impact of negatives to the loss depends on the temperature parameter. With the small temperature, only hard negatives impact the loss, whereas impact of easy negatives is negligible. With the large temperature impact of easy negatives to the loss increases.
  • Figure 2: Visualization of our MM-TS approach for computing the InfoNCE loss. First, we cluster text annotations to estimate the distribution of the training data. For each cluster, we assign the respective cluster-based shift ($sh(c)$), where to the larger cluster we assign a larger shift. The base temperature $\tau_{base}$ follows the cosine schedule. Next, we adjust the temperature for each individual sample based on the estimate cluster-based shifts resulting in $\tau_i$. The updated individual temperature $\tau_i$ is used in a standard InfoNCE loss.
  • Figure 3: Detailed visualization of temperature calculation for every cluster $c$ on every training iteration $t$ based on the cluster distribution. Given the cosine schedule for the base temperature $\tau_{base}$ amplitude $\alpha$, oscillation period $T$ and cluster-based shifts $sh(c)$, we calculate temperature for each cluster.
  • Figure 4: Visualization of the annotation embeddings in the YouCook2 dataset using tSNE. Each point represents a video annotation, and colors indicate the assigned clusters. Number of clusters is 200.
  • Figure 5: Visualization of long-tail annotations distribution of YouCook2 dataset. Annotations distribution is calculated based on k-mean clustering (200 clusters) of the annotation embeddings. Annotation embeddings are generated using SentenceBERT model reimers-2019-sentence-bert.
  • ...and 6 more figures