Table of Contents
Fetching ...

ContextBLIP: Doubly Contextual Alignment for Contrastive Image Retrieval from Linguistically Complex Descriptions

Honglin Lin, Siyu Li, Guoshun Nan, Chaoyue Tang, Xueting Wang, Jingxin Xu, Rong Yankai, Zhili Zhou, Yutong Gao, Qimei Cui, Xiaofeng Tao

TL;DR

ContextBLIP tackles image retrieval from linguistically complex descriptions by introducing a doubly contextual alignment strategy. It first performs intra-contextual alignment with a multi-scale adapter and text-guided masking losses, then applies an inter-context Transformer to model dependencies across multiple candidate images. The approach achieves state-of-the-art results on IMAGECODE in both zero-shot and fine-tuned settings, and reaches GPT-4V-level performance with orders of magnitude fewer parameters. This work demonstrates that targeted, context-aware cross-modal supervision can substantially improve fine-grained alignment for IRCD and offers a scalable path toward more capable, lightweight vision-language retrieval systems.

Abstract

Image retrieval from contextual descriptions (IRCD) aims to identify an image within a set of minimally contrastive candidates based on linguistically complex text. Despite the success of VLMs, they still significantly lag behind human performance in IRCD. The main challenges lie in aligning key contextual cues in two modalities, where these subtle cues are concealed in tiny areas of multiple contrastive images and within the complex linguistics of textual descriptions. This motivates us to propose ContextBLIP, a simple yet effective method that relies on a doubly contextual alignment scheme for challenging IRCD. Specifically, 1) our model comprises a multi-scale adapter, a matching loss, and a text-guided masking loss. The adapter learns to capture fine-grained visual cues. The two losses enable iterative supervision for the adapter, gradually highlighting the focal patches of a single image to the key textual cues. We term such a way as intra-contextual alignment. 2) Then, ContextBLIP further employs an inter-context encoder to learn dependencies among candidates, facilitating alignment between the text to multiple images. We term this step as inter-contextual alignment. Consequently, the nuanced cues concealed in each modality can be effectively aligned. Experiments on two benchmarks show the superiority of our method. We observe that ContextBLIP can yield comparable results with GPT-4V, despite involving about 7,500 times fewer parameters.

ContextBLIP: Doubly Contextual Alignment for Contrastive Image Retrieval from Linguistically Complex Descriptions

TL;DR

ContextBLIP tackles image retrieval from linguistically complex descriptions by introducing a doubly contextual alignment strategy. It first performs intra-contextual alignment with a multi-scale adapter and text-guided masking losses, then applies an inter-context Transformer to model dependencies across multiple candidate images. The approach achieves state-of-the-art results on IMAGECODE in both zero-shot and fine-tuned settings, and reaches GPT-4V-level performance with orders of magnitude fewer parameters. This work demonstrates that targeted, context-aware cross-modal supervision can substantially improve fine-grained alignment for IRCD and offers a scalable path toward more capable, lightweight vision-language retrieval systems.

Abstract

Image retrieval from contextual descriptions (IRCD) aims to identify an image within a set of minimally contrastive candidates based on linguistically complex text. Despite the success of VLMs, they still significantly lag behind human performance in IRCD. The main challenges lie in aligning key contextual cues in two modalities, where these subtle cues are concealed in tiny areas of multiple contrastive images and within the complex linguistics of textual descriptions. This motivates us to propose ContextBLIP, a simple yet effective method that relies on a doubly contextual alignment scheme for challenging IRCD. Specifically, 1) our model comprises a multi-scale adapter, a matching loss, and a text-guided masking loss. The adapter learns to capture fine-grained visual cues. The two losses enable iterative supervision for the adapter, gradually highlighting the focal patches of a single image to the key textual cues. We term such a way as intra-contextual alignment. 2) Then, ContextBLIP further employs an inter-context encoder to learn dependencies among candidates, facilitating alignment between the text to multiple images. We term this step as inter-contextual alignment. Consequently, the nuanced cues concealed in each modality can be effectively aligned. Experiments on two benchmarks show the superiority of our method. We observe that ContextBLIP can yield comparable results with GPT-4V, despite involving about 7,500 times fewer parameters.
Paper Structure (26 sections, 5 equations, 12 figures, 16 tables)

This paper contains 26 sections, 5 equations, 12 figures, 16 tables.

Figures (12)

  • Figure 1: An instance selected from a public benchmark of IRCD, which involves six very similar contrastive image candidates, and the query "Middle girl's hand is blurry and shoulder level, her eyes are almost shut, the girl on the right is looking at the middle girl's hand". The target image is the 4-th one in red rectangular box.
  • Figure 2: (a) Architecture of our ContextBLIP, including a BLIP-based intra-context encoder, a scorer for image-text matching (ITM, $\mathcal{L}_{itm}$), and a Transformer-based decoder for text-guided masked image modeling (TMIM, $\mathcal{L}_{tmim}$). (b) The multi-scale adapter in the encoder is co-supervised by $\mathcal{L}_{itm}$ and $\mathcal{L}_{tmim}$ on COCO&VG datasets, while BLIP is frozen. (c) The learnable text-guided mask is iteratively updated under the co-supervision. (d) Zero-shot ContextBLIP on the IRCD task. (e) Fine-tuning ContextBLIP for IRCD with the inter-context encoder.
  • Figure 3: (a) Zero-shot: PBLIP and POurs are two matching scores of BLIP and ours, and PBLIP', POurs' are scores for the key contextual cue. (b) Fine-tuned: PBLIP and POurs are two matching scores of BLIP and ours, and PBLIP', POurs' are scores for the key contextual cue.
  • Figure 4: Case illustration of how we prompt GPT-4V for IRCD. The red boxes represent the GPT-4V's response and the yellow one indicates our prediction.
  • Figure 5: Zero-shot cases from the test set. Our model has advantages over BLIP in both confidence scores.
  • ...and 7 more figures