Table of Contents
Fetching ...

Improving fine-grained understanding in image-text pre-training

Ioana Bica, Anastasija Ilić, Matthias Bauer, Goker Erdogan, Matko Bošnjak, Christos Kaplanis, Alexey A. Gritsenko, Matthias Minderer, Charles Blundell, Razvan Pascanu, Jovana Mitrović

TL;DR

SPARC introduces Sparse Fine-grained Contrastive Alignment to jointly learn global and local multimodal representations from image-text data. By sparsely grouping image patches into language-grounded embeddings for each caption token and optimizing a sequence-wise local loss alongside a global contrastive loss, SPARC captures fine-grained details without prohibitive memory costs. Across zero-shot classification, image-text retrieval, object detection, and semantic segmentation, SPARC consistently outperforms strong baselines and improves faithfulness in generated captions. The approach maintains scalable compute, enables better localization, and shows promise for integration into large vision-language models.

Abstract

We introduce SPARse Fine-grained Contrastive Alignment (SPARC), a simple method for pretraining more fine-grained multimodal representations from image-text pairs. Given that multiple image patches often correspond to single words, we propose to learn a grouping of image patches for every token in the caption. To achieve this, we use a sparse similarity metric between image patches and language tokens and compute for each token a language-grouped vision embedding as the weighted average of patches. The token and language-grouped vision embeddings are then contrasted through a fine-grained sequence-wise loss that only depends on individual samples and does not require other batch samples as negatives. This enables more detailed information to be learned in a computationally inexpensive manner. SPARC combines this fine-grained loss with a contrastive loss between global image and text embeddings to learn representations that simultaneously encode global and local information. We thoroughly evaluate our proposed method and show improved performance over competing approaches both on image-level tasks relying on coarse-grained information, e.g. classification, as well as region-level tasks relying on fine-grained information, e.g. retrieval, object detection, and segmentation. Moreover, SPARC improves model faithfulness and captioning in foundational vision-language models.

Improving fine-grained understanding in image-text pre-training

TL;DR

SPARC introduces Sparse Fine-grained Contrastive Alignment to jointly learn global and local multimodal representations from image-text data. By sparsely grouping image patches into language-grounded embeddings for each caption token and optimizing a sequence-wise local loss alongside a global contrastive loss, SPARC captures fine-grained details without prohibitive memory costs. Across zero-shot classification, image-text retrieval, object detection, and semantic segmentation, SPARC consistently outperforms strong baselines and improves faithfulness in generated captions. The approach maintains scalable compute, enables better localization, and shows promise for integration into large vision-language models.

Abstract

We introduce SPARse Fine-grained Contrastive Alignment (SPARC), a simple method for pretraining more fine-grained multimodal representations from image-text pairs. Given that multiple image patches often correspond to single words, we propose to learn a grouping of image patches for every token in the caption. To achieve this, we use a sparse similarity metric between image patches and language tokens and compute for each token a language-grouped vision embedding as the weighted average of patches. The token and language-grouped vision embeddings are then contrasted through a fine-grained sequence-wise loss that only depends on individual samples and does not require other batch samples as negatives. This enables more detailed information to be learned in a computationally inexpensive manner. SPARC combines this fine-grained loss with a contrastive loss between global image and text embeddings to learn representations that simultaneously encode global and local information. We thoroughly evaluate our proposed method and show improved performance over competing approaches both on image-level tasks relying on coarse-grained information, e.g. classification, as well as region-level tasks relying on fine-grained information, e.g. retrieval, object detection, and segmentation. Moreover, SPARC improves model faithfulness and captioning in foundational vision-language models.
Paper Structure (39 sections, 9 equations, 6 figures, 11 tables)

This paper contains 39 sections, 9 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: For every text token, SPARC learns a corresponding language-grouped vision embedding as the alignment-weighted combination of patches that are most similar to that token. We calculate a sparse similarity metric between tokens and patches of individual image-text pairs (left) and use it to compute the resulting alignment weights (middle). We contrast the language-grouped vision embeddings with token embeddings in a fine-grained contrastive sequence-wise loss (right).
  • Figure 2: Overall architecture for SPARC. The global alignment loss maximizes the similarity between the global vision and global text embeddings, while minimizing the similarity with the other global embeddings in the batch. To obtain the finegrained alignment, we compute the similarity between the patch embeddings and the token embeddings and then sparsify and normalize the resulting similarity matrix to obtain alignment weights. These alignment weights are then used to group the patch embeddings. The resulting language-grouped vision embeddings are then contrasted to the token emebddings in a sequence-wise finegrained alignment loss.
  • Figure 3: Qualitative results for zero-shot segmentation on Pascal VOC dataset. We illustrate the original image, pixel-level ground-truth labels and the the patch-level segmentation masks obtained from SPARC, GLoRIA and CLIP.
  • Figure 4: TFLOPS (a) and Peak Memory (b) used by all methods. Relative increase in TFLOPS (c) and Peak memory (d) when comparing SPARC and MGCA to CLIP.
  • Figure 5: SPARC vs CLIP vs Ground Truth for examples where SPARC has higher all-token $\mathcal{K}$-Precision ($\mathcal{K}$-P)
  • ...and 1 more figures