Impact of Language Guidance: A Reproducibility Study
Cherish Puniani, Advika Sinha, Shree Singhi, Aayan Yadav
TL;DR
This work rigorously evaluates language-guided sampling for contrastive self-supervised learning by reproducing Banani et al.'s setup, identifying low-quality RedCaps captions, and substituting higher-quality BLIP-2 captions with an ITM filter. It demonstrates that caption quality and embedding size critically influence gains, and that language-guided models are prone to early overfitting without proper stopping criteria. A saliency-based metric is introduced to assess the semantic capabilities of SSL models, though results show limited improvements across backbones and datasets. The study highlights the conditional benefits of language guidance and emphasizes dataset curation and backbone-aware training design for robust performance gains.
Abstract
Modern deep-learning architectures need large amounts of data to produce state-of-the-art results. Annotating such huge datasets is time-consuming, expensive, and prone to human error. Recent advances in self-supervised learning allow us to train huge models without explicit annotation. Contrastive learning is a popular paradigm in self-supervised learning. Recent works like SimCLR and CLIP rely on image augmentations or directly minimizing cross-modal loss between image and text. Banani et al. (2023) propose to use language guidance to sample view pairs. They claim that language enables better conceptual similarity, eliminating the effects of visual variability. We reproduce their experiments to verify their claims and find that their dataset, RedCaps, contains low-quality captions. We use an off-the-shelf image captioning model, BLIP-2, to replace the captions and improve performance, and we also devise a new metric to evaluate the semantic capabilities of self-supervised models based on interpretability methods.
