Table of Contents
Fetching ...

Grounding Descriptions in Images informs Zero-Shot Visual Recognition

Shaunak Halbe, Junjiao Tian, K J Joseph, James Seale Smith, Katherine Stevo, Vineeth N Balasubramanian, Zsolt Kira

TL;DR

GRAIN tackles the limitations of open-vocabulary vision-language models by jointly grounding region-level descriptions and aligning global captions with images. It leverages synthetic region descriptions from a Multimodal LLM and region localizations from an open-vocabulary detector to create weak, scalable supervision, trained with a DETR-inspired architecture and three losses: $L_{ic}$, $L_{box}$, and $L_{rd}$ where $L_{total} = L_{ic} + L_{box} + L_{rd}$. The approach yields substantial gains over CLIP across 11 datasets for zero-shot classification and cross-modal retrieval, and demonstrates strong performance on novel concepts via the Products-2023 benchmark. By providing rich region-text correspondences and enabling test-time descriptions, GRAIN offers a practical path toward more fine-grained and versatile open-vocabulary recognition with relatively compact models.

Abstract

Vision-language models (VLMs) like CLIP have been cherished for their ability to perform zero-shot visual recognition on open-vocabulary concepts. This is achieved by selecting the object category whose textual representation bears the highest similarity with the query image. While successful in some domains, this method struggles with identifying fine-grained entities as well as generalizing to unseen concepts that are not captured by the training distribution. Recent works attempt to mitigate these challenges by integrating category descriptions at test time, albeit yielding modest improvements. We attribute these limited gains to a fundamental misalignment between image and description representations, which is rooted in the pretraining structure of CLIP. In this paper, we propose GRAIN, a new pretraining strategy aimed at aligning representations at both fine and coarse levels simultaneously. Our approach learns to jointly ground textual descriptions in image regions along with aligning overarching captions with global image representations. To drive this pre-training, we leverage frozen Multimodal Large Language Models (MLLMs) to derive large-scale synthetic annotations. We demonstrate the enhanced zero-shot performance of our model compared to current state-of-the art methods across 11 diverse image classification datasets. Additionally, we introduce Products-2023, a newly curated, manually labeled dataset featuring novel concepts, and showcase our model's ability to recognize these concepts by benchmarking on it. Significant improvements achieved by our model on other downstream tasks like retrieval further highlight the superior quality of representations learned by our approach. Code available at https://github.com/shaunak27/grain-clip .

Grounding Descriptions in Images informs Zero-Shot Visual Recognition

TL;DR

GRAIN tackles the limitations of open-vocabulary vision-language models by jointly grounding region-level descriptions and aligning global captions with images. It leverages synthetic region descriptions from a Multimodal LLM and region localizations from an open-vocabulary detector to create weak, scalable supervision, trained with a DETR-inspired architecture and three losses: , , and where . The approach yields substantial gains over CLIP across 11 datasets for zero-shot classification and cross-modal retrieval, and demonstrates strong performance on novel concepts via the Products-2023 benchmark. By providing rich region-text correspondences and enabling test-time descriptions, GRAIN offers a practical path toward more fine-grained and versatile open-vocabulary recognition with relatively compact models.

Abstract

Vision-language models (VLMs) like CLIP have been cherished for their ability to perform zero-shot visual recognition on open-vocabulary concepts. This is achieved by selecting the object category whose textual representation bears the highest similarity with the query image. While successful in some domains, this method struggles with identifying fine-grained entities as well as generalizing to unseen concepts that are not captured by the training distribution. Recent works attempt to mitigate these challenges by integrating category descriptions at test time, albeit yielding modest improvements. We attribute these limited gains to a fundamental misalignment between image and description representations, which is rooted in the pretraining structure of CLIP. In this paper, we propose GRAIN, a new pretraining strategy aimed at aligning representations at both fine and coarse levels simultaneously. Our approach learns to jointly ground textual descriptions in image regions along with aligning overarching captions with global image representations. To drive this pre-training, we leverage frozen Multimodal Large Language Models (MLLMs) to derive large-scale synthetic annotations. We demonstrate the enhanced zero-shot performance of our model compared to current state-of-the art methods across 11 diverse image classification datasets. Additionally, we introduce Products-2023, a newly curated, manually labeled dataset featuring novel concepts, and showcase our model's ability to recognize these concepts by benchmarking on it. Significant improvements achieved by our model on other downstream tasks like retrieval further highlight the superior quality of representations learned by our approach. Code available at https://github.com/shaunak27/grain-clip .

Paper Structure

This paper contains 25 sections, 1 equation, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Overview of our two-stage annotation process: (1) prompting LLaVA for image descriptions and (2) acquiring corresponding region annotations from OWLv2.
  • Figure 2: Architecture overview. Our method, GRAIN, aligns image representations to text captions at a global level while localizing salient image regions and aligning them to text descriptions at the local level.
  • Figure 3: Contrastively align predicted regions with descriptions.
  • Figure 4: For zero-shot image classification, the image output embedding is compared with text embeddings of classnames enriched with descriptions.
  • Figure A: Attention maps show more effective object localization by our model compared to CLIP.
  • ...and 5 more figures