Table of Contents
Fetching ...

Impact of Language Guidance: A Reproducibility Study

Cherish Puniani, Advika Sinha, Shree Singhi, Aayan Yadav

TL;DR

This work rigorously evaluates language-guided sampling for contrastive self-supervised learning by reproducing Banani et al.'s setup, identifying low-quality RedCaps captions, and substituting higher-quality BLIP-2 captions with an ITM filter. It demonstrates that caption quality and embedding size critically influence gains, and that language-guided models are prone to early overfitting without proper stopping criteria. A saliency-based metric is introduced to assess the semantic capabilities of SSL models, though results show limited improvements across backbones and datasets. The study highlights the conditional benefits of language guidance and emphasizes dataset curation and backbone-aware training design for robust performance gains.

Abstract

Modern deep-learning architectures need large amounts of data to produce state-of-the-art results. Annotating such huge datasets is time-consuming, expensive, and prone to human error. Recent advances in self-supervised learning allow us to train huge models without explicit annotation. Contrastive learning is a popular paradigm in self-supervised learning. Recent works like SimCLR and CLIP rely on image augmentations or directly minimizing cross-modal loss between image and text. Banani et al. (2023) propose to use language guidance to sample view pairs. They claim that language enables better conceptual similarity, eliminating the effects of visual variability. We reproduce their experiments to verify their claims and find that their dataset, RedCaps, contains low-quality captions. We use an off-the-shelf image captioning model, BLIP-2, to replace the captions and improve performance, and we also devise a new metric to evaluate the semantic capabilities of self-supervised models based on interpretability methods.

Impact of Language Guidance: A Reproducibility Study

TL;DR

This work rigorously evaluates language-guided sampling for contrastive self-supervised learning by reproducing Banani et al.'s setup, identifying low-quality RedCaps captions, and substituting higher-quality BLIP-2 captions with an ITM filter. It demonstrates that caption quality and embedding size critically influence gains, and that language-guided models are prone to early overfitting without proper stopping criteria. A saliency-based metric is introduced to assess the semantic capabilities of SSL models, though results show limited improvements across backbones and datasets. The study highlights the conditional benefits of language guidance and emphasizes dataset curation and backbone-aware training design for robust performance gains.

Abstract

Modern deep-learning architectures need large amounts of data to produce state-of-the-art results. Annotating such huge datasets is time-consuming, expensive, and prone to human error. Recent advances in self-supervised learning allow us to train huge models without explicit annotation. Contrastive learning is a popular paradigm in self-supervised learning. Recent works like SimCLR and CLIP rely on image augmentations or directly minimizing cross-modal loss between image and text. Banani et al. (2023) propose to use language guidance to sample view pairs. They claim that language enables better conceptual similarity, eliminating the effects of visual variability. We reproduce their experiments to verify their claims and find that their dataset, RedCaps, contains low-quality captions. We use an off-the-shelf image captioning model, BLIP-2, to replace the captions and improve performance, and we also devise a new metric to evaluate the semantic capabilities of self-supervised models based on interpretability methods.

Paper Structure

This paper contains 14 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Refining Representation Learning. While early contrastive learning methods relied on simple image transformations, newer structured retrieval techniques have emerged to refine the learned embeddings beyond instance-level. These employ clustering, memory banks, or language-driven sampling to introduce structure into training signals and subsequently improve visual representation learning.
  • Figure 2: Improving Captions via Contrastive Filtering. Our caption improvement method leverages BLIPv2 to generate candidate captions that better describe an image. Since dataset-provided captions can be less relevant or inaccurate, we introduce an Image-Text Matching (ITM) Filter to evaluate, assign a relevance score and select the most appropriate caption between of the two. This ensures better semantic and visual alignment of a caption with its corresponding image.
  • Figure 3: A schematic comparison of SimSiam and Language Guided SimSiam trained using our pipeline.
  • Figure 4: Visualisations. Different models perform better on different classes. Top: LGSimCLR (Ours) performs the best. Middle: SimCLR performs the best. Bottom: LGSimCLR performs the best.
  • Figure 5: Overfitting of LGSimSiam. We visualize the advantage of using early stopping by comparing the performance of LGSimSiam at the $25^{th}$ and $6^{th}$ epoch. Left:banani2023learningRight: Ours.
  • ...and 1 more figures