Table of Contents
Fetching ...

EntityCLIP: Entity-Centric Image-Text Matching via Multimodal Attentive Contrastive Learning

Yaxiong Wang, Yujiao Wu, Lianwei Wu, Lechao Cheng, Zhun Zhong, Meng Wang

TL;DR

EntityCLIP tackles the challenge of entity-centric image-text matching (EITM) by bridging the semantic gap between precise entity queries and visual content. It builds on CLIP by incorporating Large Language Model (LLM) generated explanation text as bridging clues and processing them through a Multimodal Attentive Experts (MMAE) module to produce enriched image and text representations. The training objective combines a visual-text contrastive loss with an auxiliary GI-ITM loss, formulated as $\mathcal{L} = \mathcal{L}_{VTC}(V_{cls},T_{cls}) + \eta\mathcal{L}_{GFM} + \lambda\mathcal{L}_{VTC}(V^*,T^*)$, with during-training use of explanation-driven features and during inference using only standard CLIP encoders. Evaluations on N24News, VisualNews, and GoodNews show that EntityCLIP consistently outperforms strong baselines, with notable gains in Recall@1 and robustness across datasets, demonstrating the practical viability of LLM-informed bridging for entity-centric retrieval.

Abstract

Recent advancements in image-text matching have been notable, yet prevailing models predominantly cater to broad queries and struggle with accommodating fine-grained query intention. In this paper, we work towards the \textbf{E}ntity-centric \textbf{I}mage-\textbf{T}ext \textbf{M}atching (EITM), a task that the text and image involve specific entity-related information. The challenge of this task mainly lies in the larger semantic gap in entity association modeling, comparing with the general image-text matching problem.To narrow the huge semantic gap between the entity-centric text and the images, we take the fundamental CLIP as the backbone and devise a multimodal attentive contrastive learning framework to tam CLIP to adapt EITM problem, developing a model named EntityCLIP. The key of our multimodal attentive contrastive learning is to generate interpretive explanation text using Large Language Models (LLMs) as the bridge clues. In specific, we proceed by extracting explanatory text from off-the-shelf LLMs. This explanation text, coupled with the image and text, is then input into our specially crafted Multimodal Attentive Experts (MMAE) module, which effectively integrates explanation texts to narrow the gap of the entity-related text and image in a shared semantic space. Building on the enriched features derived from MMAE, we further design an effective Gated Integrative Image-text Matching (GI-ITM) strategy. The GI-ITM employs an adaptive gating mechanism to aggregate MMAE's features, subsequently applying image-text matching constraints to steer the alignment between the text and the image. Extensive experiments are conducted on three social media news benchmarks including N24News, VisualNews, and GoodNews, the results shows that our method surpasses the competition methods with a clear margin.

EntityCLIP: Entity-Centric Image-Text Matching via Multimodal Attentive Contrastive Learning

TL;DR

EntityCLIP tackles the challenge of entity-centric image-text matching (EITM) by bridging the semantic gap between precise entity queries and visual content. It builds on CLIP by incorporating Large Language Model (LLM) generated explanation text as bridging clues and processing them through a Multimodal Attentive Experts (MMAE) module to produce enriched image and text representations. The training objective combines a visual-text contrastive loss with an auxiliary GI-ITM loss, formulated as , with during-training use of explanation-driven features and during inference using only standard CLIP encoders. Evaluations on N24News, VisualNews, and GoodNews show that EntityCLIP consistently outperforms strong baselines, with notable gains in Recall@1 and robustness across datasets, demonstrating the practical viability of LLM-informed bridging for entity-centric retrieval.

Abstract

Recent advancements in image-text matching have been notable, yet prevailing models predominantly cater to broad queries and struggle with accommodating fine-grained query intention. In this paper, we work towards the \textbf{E}ntity-centric \textbf{I}mage-\textbf{T}ext \textbf{M}atching (EITM), a task that the text and image involve specific entity-related information. The challenge of this task mainly lies in the larger semantic gap in entity association modeling, comparing with the general image-text matching problem.To narrow the huge semantic gap between the entity-centric text and the images, we take the fundamental CLIP as the backbone and devise a multimodal attentive contrastive learning framework to tam CLIP to adapt EITM problem, developing a model named EntityCLIP. The key of our multimodal attentive contrastive learning is to generate interpretive explanation text using Large Language Models (LLMs) as the bridge clues. In specific, we proceed by extracting explanatory text from off-the-shelf LLMs. This explanation text, coupled with the image and text, is then input into our specially crafted Multimodal Attentive Experts (MMAE) module, which effectively integrates explanation texts to narrow the gap of the entity-related text and image in a shared semantic space. Building on the enriched features derived from MMAE, we further design an effective Gated Integrative Image-text Matching (GI-ITM) strategy. The GI-ITM employs an adaptive gating mechanism to aggregate MMAE's features, subsequently applying image-text matching constraints to steer the alignment between the text and the image. Extensive experiments are conducted on three social media news benchmarks including N24News, VisualNews, and GoodNews, the results shows that our method surpasses the competition methods with a clear margin.

Paper Structure

This paper contains 12 sections, 12 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: In comparison with general image-text matching (subfigure (a)), Entity-centric Image-text matching (EITM) requires the model to learn deeper by understanding and discriminating the specific entities under the general concepts (subfigure (b)). For example,"Queen Elizabeth II" in woman, and "statue of Eric Morecambe" in statue. This specificity introduces a substantial semantic gap, presenting a significant challenge for cross-modal retrieval.
  • Figure 2: Illustration of training EntityCLIP. Initially, we harness Large Language Models (LLMs) to generate explanation text based on the entity-text query. This text, along with the query and image, is then encoded to derive representations. These are subsequently processed by the Multimodal Attentive Experts (MMAE) to integrate the query and image features, leveraging the explanation text to bridge semantic disparities. The framework is optimized through contrastive learning, coupled with a Gated Image-text Matching loss to refine the alignment and learning of the network.
  • Figure 3: Explanation text example for an entity-centric query. The explanation text can offer visual details regarding the entities of Donald Trump, and further explain some occasion like the crowd, thereby narrowing the semantic gap.
  • Figure 4: Visualization of the cross attention in an explanation expert. The top 5 attended words for entities in image patches (words in red) and query text (words with blue frame ) are picked.
  • Figure 5: Illustration of utilizing the trained EntityCLIP without fine-tuning to perform multimodal news classification. The similarities from the image and the headline of the multimodal news are averaged as the final similarity score.
  • ...and 1 more figures