Table of Contents
Fetching ...

Variance-Aware Loss Scheduling for Multimodal Alignment in Low-Data Settings

Sneh Pillai

TL;DR

This work tackles image-text alignment in low-data regimes by introducing variance-aware loss scheduling, which dynamically weights the two directions of a symmetric contrastive loss based on the model’s observed variability in similarity scores. The approach is evaluated on Flickr8k and compared against entropy-based and cosine-spread adaptive strategies, as well as a fixed-weight baseline, demonstrating improved retrieval accuracy and more distinct multimodal embeddings. Key contributions include a principled, low-overhead weighting scheme using EMA-smoothed variances, a thorough empirical comparison with baselines, and demonstrated robustness to noisy training data. The results suggest variance-guided weighting can enhance sample efficiency and resilience in multimodal learning when data are scarce, with potential applicability to broader tasks and larger datasets in future work.

Abstract

Training vision-language models for image-text alignment typically requires large datasets to achieve robust performance. In low-data scenarios, standard contrastive learning can struggle to align modalities effectively due to overfitting and unstable training dynamics. In this paper, we propose a variance-aware loss scheduling approach that dynamically adjusts the weighting of the contrastive loss based on the statistical variability (uncertainty) in the model's alignment predictions. Using a subset of the Flickr8k image-caption dataset to simulate limited data conditions, we demonstrate that our approach improves image-text retrieval accuracy compared to a fixed-weight baseline. We also compare against other adaptive weighting strategies (using output entropy and cosine similarity spread) and find that variance-aware scheduling provides the best overall trade-off. Qualitatively, our method yields more distinct multimodal embeddings as shown by t-SNE visualizations. Moreover, in a stress test with noise-injected captions and images, the variance-guided loss proves more robust, maintaining higher recall when random perturbations are introduced. These results highlight the benefit of adaptive loss weighting for multimodal alignment in low-data regimes.

Variance-Aware Loss Scheduling for Multimodal Alignment in Low-Data Settings

TL;DR

This work tackles image-text alignment in low-data regimes by introducing variance-aware loss scheduling, which dynamically weights the two directions of a symmetric contrastive loss based on the model’s observed variability in similarity scores. The approach is evaluated on Flickr8k and compared against entropy-based and cosine-spread adaptive strategies, as well as a fixed-weight baseline, demonstrating improved retrieval accuracy and more distinct multimodal embeddings. Key contributions include a principled, low-overhead weighting scheme using EMA-smoothed variances, a thorough empirical comparison with baselines, and demonstrated robustness to noisy training data. The results suggest variance-guided weighting can enhance sample efficiency and resilience in multimodal learning when data are scarce, with potential applicability to broader tasks and larger datasets in future work.

Abstract

Training vision-language models for image-text alignment typically requires large datasets to achieve robust performance. In low-data scenarios, standard contrastive learning can struggle to align modalities effectively due to overfitting and unstable training dynamics. In this paper, we propose a variance-aware loss scheduling approach that dynamically adjusts the weighting of the contrastive loss based on the statistical variability (uncertainty) in the model's alignment predictions. Using a subset of the Flickr8k image-caption dataset to simulate limited data conditions, we demonstrate that our approach improves image-text retrieval accuracy compared to a fixed-weight baseline. We also compare against other adaptive weighting strategies (using output entropy and cosine similarity spread) and find that variance-aware scheduling provides the best overall trade-off. Qualitatively, our method yields more distinct multimodal embeddings as shown by t-SNE visualizations. Moreover, in a stress test with noise-injected captions and images, the variance-guided loss proves more robust, maintaining higher recall when random perturbations are introduced. These results highlight the benefit of adaptive loss weighting for multimodal alignment in low-data regimes.

Paper Structure

This paper contains 17 sections, 2 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Training loss curves comparing clean vs noisy data scenarios. Adaptive methods like variance-aware loss scheduling degrade more gracefully than fixed-weight training.
  • Figure 2: Retrieval performance (Recall@5) on clean test set after training on noisy data. Variance-aware loss scheduling retains the highest performance under noise.
  • Figure 3: t-SNE visualization of image and caption embeddings (test set). (a) Fixed-loss baseline: image (▲) and caption (●) embeddings form mixed clusters, and some modality gap is visible (images and texts not perfectly aligned). (b) Variance-aware loss scheduling (ours): embeddings show tighter image-caption grouping and clearer separation between different semantic clusters. Best viewed in color.