Table of Contents
Fetching ...

COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training

Sanghwan Kim, Rui Xiao, Mariana-Iuliana Georgescu, Stephan Alaniz, Zeynep Akata

TL;DR

COSMOS introduces cross-modality self-distillation for vision-language pre-training by combining a novel text-cropping strategy with a cross-attention module. The approach uses a teacher-student framework with multi-modal augmentations, where the cross-modality distillation signal flows through both image and text encoders to achieve fine-grained grounding. Empirical results across zero-shot retrieval, classification, segmentation, and perception tasks demonstrate significant gains over CLIP-based baselines, including models trained on vastly larger datasets, with data efficiency evident across CC3M, CC12M, YFCC15M, and Merged-30M. This work advances multi-modal grounding and contextual understanding while reducing data requirements for competitive vision-language models.

Abstract

Vision-Language Models (VLMs) trained with contrastive loss have achieved significant advancements in various vision and language tasks. However, the global nature of the contrastive loss makes VLMs focus predominantly on foreground objects, neglecting other crucial information in the image, which limits their effectiveness in downstream tasks. To address these challenges, we propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training that integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework. We create global and local views of images and texts (i.e., multi-modal augmentations), which are essential for self-distillation in VLMs. We further introduce a cross-attention module, enabling COSMOS to learn comprehensive cross-modal representations optimized via a cross-modality self-distillation loss. COSMOS consistently outperforms previous strong baselines on various zero-shot downstream tasks, including retrieval, classification, and semantic segmentation. Additionally, it surpasses CLIP-based models trained on larger datasets in visual perception and contextual understanding tasks. Code is available at https://github.com/ExplainableML/cosmos.

COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training

TL;DR

COSMOS introduces cross-modality self-distillation for vision-language pre-training by combining a novel text-cropping strategy with a cross-attention module. The approach uses a teacher-student framework with multi-modal augmentations, where the cross-modality distillation signal flows through both image and text encoders to achieve fine-grained grounding. Empirical results across zero-shot retrieval, classification, segmentation, and perception tasks demonstrate significant gains over CLIP-based baselines, including models trained on vastly larger datasets, with data efficiency evident across CC3M, CC12M, YFCC15M, and Merged-30M. This work advances multi-modal grounding and contextual understanding while reducing data requirements for competitive vision-language models.

Abstract

Vision-Language Models (VLMs) trained with contrastive loss have achieved significant advancements in various vision and language tasks. However, the global nature of the contrastive loss makes VLMs focus predominantly on foreground objects, neglecting other crucial information in the image, which limits their effectiveness in downstream tasks. To address these challenges, we propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training that integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework. We create global and local views of images and texts (i.e., multi-modal augmentations), which are essential for self-distillation in VLMs. We further introduce a cross-attention module, enabling COSMOS to learn comprehensive cross-modal representations optimized via a cross-modality self-distillation loss. COSMOS consistently outperforms previous strong baselines on various zero-shot downstream tasks, including retrieval, classification, and semantic segmentation. Additionally, it surpasses CLIP-based models trained on larger datasets in visual perception and contextual understanding tasks. Code is available at https://github.com/ExplainableML/cosmos.

Paper Structure

This paper contains 32 sections, 6 equations, 5 figures, 20 tables, 1 algorithm.

Figures (5)

  • Figure 1: We randomly crop the image and randomly select captions (1-5 sentences) to build global and local crops (top). In our cross attention module, two modalities are conditioned on each other to create attention maps (normalized to 0-1, bottom).
  • Figure 2: An overview of COSMOS. Left: Our VLM pre-training mechanism is based on the student-teacher framework with contrastive loss ($\mathcal{L}_{\text{CLIP}}$) for multi-modal alignment, and cross-modality self-distillation loss ($\mathcal{L}_{\text{COSMOS}}$) for fine-grained representation learning. Right: The architecture of the student and teacher model with cross-attention modules that extract cross-modality information from the student.
  • Figure 3: Visualization of attention map in cross-attention modules. Attention weights are normalized between 0 and 1.
  • Figure 4: Visualization of Attention Map. For different set of captions, we visualize the attention weights of the image and text cross-attention modules. The patch-wise (image) and token-wise (caption) attention weights are both normalized between 0 and 1.
  • Figure 5: Illustration of CLIP with self-supervised approaches.$I$ and $T$ denote the image and text encoders, respectively. $I_t$ (or $T_t$) and $I_s$ (or $T_s$) represent the teacher and student image (or text) encoders, where the teacher is an exponential moving average (EMA) of the student. (a) CLIP radford2021learning: image and text embeddings are aligned during training. (b) SLIP mu2022slip: contrastive loss is computed on sets of two different augmentations. (c) SILC naeem2023silc: self-distillation loss is obtained between local and global crops of the same image. (d) COSMOS: the cross-attention module is utilized to generate cross-modal representations which are optimized through the cross-modality self-distillation loss. We also design global and local crops of image and text pairs for effective self-supervised learning in VLMs.