Table of Contents
Fetching ...

AmCLR: Unified Augmented Learning for Cross-Modal Representations

Ajay Jagannath, Aayush Upadhyay, Anant Mehta

TL;DR

AmCLR and xAmCLR address the large-batch bottleneck in bimodal contrastive learning by extending the memory-efficient SogCLR framework with multi-modal and intra-modal augmentations. AmCLR leverages cross-modal augmentations (image/text) while xAmCLR additionally incorporates intra-modal losses to enrich unimodal representations, yielding a global objective optimized via a stochastic gradient estimator. Empirically, both losses outperform SogCLR and iSogCLR on text and image retrieval and zero-shot tasks with a modest batch size ($|B|=128$) and a 100k CC3M training subset, using pretrained ResNet-50 and DistilBERT encoders. The results highlight enhanced robustness and generalization in vision-language pretraining under constrained compute, with clear directions for scaling, augmentation diversification, and integration with DRO in future work.

Abstract

Contrastive learning has emerged as a pivotal framework for representation learning, underpinning advances in both unimodal and bimodal applications like SimCLR and CLIP. To address fundamental limitations like large batch size dependency and bimodality, methods such as SogCLR leverage stochastic optimization for the global contrastive objective. Inspired by SogCLR's efficiency and adaptability, we introduce AmCLR and xAmCLR objective functions tailored for bimodal vision-language models to further enhance the robustness of contrastive learning. AmCLR integrates diverse augmentations, including text paraphrasing and image transformations, to reinforce the alignment of contrastive representations, keeping batch size limited to a few hundred samples unlike CLIP which needs batch size of 32,768 to produce reasonable results. xAmCLR further extends this paradigm by incorporating intra-modal alignments between original and augmented modalities for richer feature learning. These advancements yield a more resilient and generalizable contrastive learning process, aimed at overcoming bottlenecks in scaling and augmentative diversity. Since we have built our framework on the existing SogCLR, we are able to demonstrate improved representation quality with fewer computational resources, establishing a foundation for scalable and robust multi-modal learning.

AmCLR: Unified Augmented Learning for Cross-Modal Representations

TL;DR

AmCLR and xAmCLR address the large-batch bottleneck in bimodal contrastive learning by extending the memory-efficient SogCLR framework with multi-modal and intra-modal augmentations. AmCLR leverages cross-modal augmentations (image/text) while xAmCLR additionally incorporates intra-modal losses to enrich unimodal representations, yielding a global objective optimized via a stochastic gradient estimator. Empirically, both losses outperform SogCLR and iSogCLR on text and image retrieval and zero-shot tasks with a modest batch size () and a 100k CC3M training subset, using pretrained ResNet-50 and DistilBERT encoders. The results highlight enhanced robustness and generalization in vision-language pretraining under constrained compute, with clear directions for scaling, augmentation diversification, and integration with DRO in future work.

Abstract

Contrastive learning has emerged as a pivotal framework for representation learning, underpinning advances in both unimodal and bimodal applications like SimCLR and CLIP. To address fundamental limitations like large batch size dependency and bimodality, methods such as SogCLR leverage stochastic optimization for the global contrastive objective. Inspired by SogCLR's efficiency and adaptability, we introduce AmCLR and xAmCLR objective functions tailored for bimodal vision-language models to further enhance the robustness of contrastive learning. AmCLR integrates diverse augmentations, including text paraphrasing and image transformations, to reinforce the alignment of contrastive representations, keeping batch size limited to a few hundred samples unlike CLIP which needs batch size of 32,768 to produce reasonable results. xAmCLR further extends this paradigm by incorporating intra-modal alignments between original and augmented modalities for richer feature learning. These advancements yield a more resilient and generalizable contrastive learning process, aimed at overcoming bottlenecks in scaling and augmentative diversity. Since we have built our framework on the existing SogCLR, we are able to demonstrate improved representation quality with fewer computational resources, establishing a foundation for scalable and robust multi-modal learning.

Paper Structure

This paper contains 15 sections, 72 equations, 2 figures, 3 tables, 2 algorithms.

Figures (2)

  • Figure 1: Comprehensive comparison of retrieval performance across text and image modalities. Both AmCLR and xAmCLR consistently outperform baseline methods, with AmCLR (AdamP) achieving highest mean performance of 32.30% and 26.91% on text and image retrieval respectively. The results demonstrate the effectiveness of our approaches across different optimizers and modalities while maintaining computational efficiency.
  • Figure 2: Zero-shot Top-1 Accuracy Comparison. The plot shows the zero-shot learning capabilities of our approaches, with AmCLR (AdamW) achieving 25.87% accuracy, followed closely by xAmCLR variants. This demonstrates the models' ability to generalize to unseen data without additional training.