Table of Contents
Fetching ...

EvdCLIP: Improving Vision-Language Retrieval with Entity Visual Descriptions from Large Language Models

GuangHao Meng, Sunan He, Jinpeng Wang, Tao Dai, Letian Zhang, Jieming Zhu, Qing Li, Gang Wang, Rui Zhang, Yong Jiang

TL;DR

EvdCLIP tackles vision-language retrieval by injecting entity-centric visual knowledge into queries through LLM-generated Entity Visual Descriptions (EVDs). It introduces a trainable EVD-aware Rewriter (EaRW) to fuse descriptions into queries while mitigating noise via a dedicated training regime that includes a preferential ranking objective. The approach builds an offline EVD knowledge base from large-scale data, then uses EaRW to generate high-quality EVD-enhanced queries for robust cross-modal alignment within a dual-encoder CLIP framework. Empirical results across Flickr30K, MSCOCO, Huawei’s Chinese dataset, and other benchmarks demonstrate consistent gains over strong baselines and descriptor methods, with notable improvements in precision-oriented metrics and transferability. The work also highlights model editability and bias reduction through controllable EVD injection, suggesting practical benefits for real-world, domain-adaptive VLR systems.

Abstract

Vision-language retrieval (VLR) has attracted significant attention in both academia and industry, which involves using text (or images) as queries to retrieve corresponding images (or text). However, existing methods often neglect the rich visual semantics knowledge of entities, thus leading to incorrect retrieval results. To address this problem, we propose the Entity Visual Description enhanced CLIP (EvdCLIP), designed to leverage the visual knowledge of entities to enrich queries. Specifically, since humans recognize entities through visual cues, we employ a large language model (LLM) to generate Entity Visual Descriptions (EVDs) as alignment cues to complement textual data. These EVDs are then integrated into raw queries to create visually-rich, EVD-enhanced queries. Furthermore, recognizing that EVD-enhanced queries may introduce noise or low-quality expansions, we develop a novel, trainable EVD-aware Rewriter (EaRW) for vision-language retrieval tasks. EaRW utilizes EVD knowledge and the generative capabilities of the language model to effectively rewrite queries. With our specialized training strategy, EaRW can generate high-quality and low-noise EVD-enhanced queries. Extensive quantitative and qualitative experiments on image-text retrieval benchmarks validate the superiority of EvdCLIP on vision-language retrieval tasks.

EvdCLIP: Improving Vision-Language Retrieval with Entity Visual Descriptions from Large Language Models

TL;DR

EvdCLIP tackles vision-language retrieval by injecting entity-centric visual knowledge into queries through LLM-generated Entity Visual Descriptions (EVDs). It introduces a trainable EVD-aware Rewriter (EaRW) to fuse descriptions into queries while mitigating noise via a dedicated training regime that includes a preferential ranking objective. The approach builds an offline EVD knowledge base from large-scale data, then uses EaRW to generate high-quality EVD-enhanced queries for robust cross-modal alignment within a dual-encoder CLIP framework. Empirical results across Flickr30K, MSCOCO, Huawei’s Chinese dataset, and other benchmarks demonstrate consistent gains over strong baselines and descriptor methods, with notable improvements in precision-oriented metrics and transferability. The work also highlights model editability and bias reduction through controllable EVD injection, suggesting practical benefits for real-world, domain-adaptive VLR systems.

Abstract

Vision-language retrieval (VLR) has attracted significant attention in both academia and industry, which involves using text (or images) as queries to retrieve corresponding images (or text). However, existing methods often neglect the rich visual semantics knowledge of entities, thus leading to incorrect retrieval results. To address this problem, we propose the Entity Visual Description enhanced CLIP (EvdCLIP), designed to leverage the visual knowledge of entities to enrich queries. Specifically, since humans recognize entities through visual cues, we employ a large language model (LLM) to generate Entity Visual Descriptions (EVDs) as alignment cues to complement textual data. These EVDs are then integrated into raw queries to create visually-rich, EVD-enhanced queries. Furthermore, recognizing that EVD-enhanced queries may introduce noise or low-quality expansions, we develop a novel, trainable EVD-aware Rewriter (EaRW) for vision-language retrieval tasks. EaRW utilizes EVD knowledge and the generative capabilities of the language model to effectively rewrite queries. With our specialized training strategy, EaRW can generate high-quality and low-noise EVD-enhanced queries. Extensive quantitative and qualitative experiments on image-text retrieval benchmarks validate the superiority of EvdCLIP on vision-language retrieval tasks.

Paper Structure

This paper contains 48 sections, 7 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Illustration of entity visual descriptions (EVD) enhanced framework. The CLIP and WordNetCLIP which introduces the concept of entities struggle to distinguish between "camping of tents" and "village", leading to incorrect retrieval results. Our EvdCLIP leverages the EVD generated by LLMs to improve cross-modal retrieval performance.
  • Figure 2: Challenges of EVD integration to VLR. (a) Noise issue. Certain descriptions (e.g., "four wheels") may not be presented in the "stroller" in the image and query helps to reveal the entity's preferences. (b) Low-quality issue. Using templates "which has/is" to concatenate entities and descriptions can compromise fluency and introduce ambiguity.
  • Figure 3: The overall architecture of EvdCLIP comprises two components: EVD offline generation via LLMs and EVD-enhanced vision-language retrieval. First, EVD knowledge is generated offline using LLMs. Then, an EVD-aware query rewriter integrates the query with EVD to produce an EVD-enhanced query for retrieval.
  • Figure 4: EvdCLIP focuses on significant regions of the image that are semantically related to the entity. Visualization examples of image-to-text retrieval are provided. We present image queries (the first column) along with four heatmaps.
  • Figure 5: Comparison Between EvdCLIP and DesCLIP. The second column represents the image query. The first column shows the similar scores between Ground Truth and the image. The text in red annotates the errors.
  • ...and 8 more figures