Table of Contents
Fetching ...

WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition

Shan Ning, Longtian Qiu, Jiaxuan Sun, Xuming He

Abstract

Open-domain visual entity recognition (VER) seeks to associate images with entities in encyclopedic knowledge bases such as Wikipedia. Recent generative methods tailored for VER demonstrate strong performance but incur high computational costs, limiting their scalability and practical deployment. In this work, we revisit the contrastive paradigm for VER and introduce WikiCLIP, a simple yet effective framework that establishes a strong and efficient baseline for open-domain VER. WikiCLIP leverages large language model embeddings as knowledge-rich entity representations and enhances them with a Vision-Guided Knowledge Adaptor (VGKA) that aligns textual semantics with visual cues at the patch level. To further encourage fine-grained discrimination, a Hard Negative Synthesis Mechanism generates visually similar but semantically distinct negatives during training. Experimental results on popular open-domain VER benchmarks, such as OVEN, demonstrate that WikiCLIP significantly outperforms strong baselines. Specifically, WikiCLIP achieves a 16% improvement on the challenging OVEN unseen set, while reducing inference latency by nearly 100 times compared with the leading generative model, AutoVER. The project page is available at https://artanic30.github.io/project_pages/WikiCLIP/

WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition

Abstract

Open-domain visual entity recognition (VER) seeks to associate images with entities in encyclopedic knowledge bases such as Wikipedia. Recent generative methods tailored for VER demonstrate strong performance but incur high computational costs, limiting their scalability and practical deployment. In this work, we revisit the contrastive paradigm for VER and introduce WikiCLIP, a simple yet effective framework that establishes a strong and efficient baseline for open-domain VER. WikiCLIP leverages large language model embeddings as knowledge-rich entity representations and enhances them with a Vision-Guided Knowledge Adaptor (VGKA) that aligns textual semantics with visual cues at the patch level. To further encourage fine-grained discrimination, a Hard Negative Synthesis Mechanism generates visually similar but semantically distinct negatives during training. Experimental results on popular open-domain VER benchmarks, such as OVEN, demonstrate that WikiCLIP significantly outperforms strong baselines. Specifically, WikiCLIP achieves a 16% improvement on the challenging OVEN unseen set, while reducing inference latency by nearly 100 times compared with the leading generative model, AutoVER. The project page is available at https://artanic30.github.io/project_pages/WikiCLIP/
Paper Structure (45 sections, 11 equations, 10 figures, 6 tables)

This paper contains 45 sections, 11 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Performance comparison. We compare WikiCLIP with strong baselines in terms of inference latency and generalization ability, as measured by unseen accuracy. As shown in the figure, WikiCLIP delivers state-of-the-art performance in OVEN unseen accuracy while maintaining low inference latency.
  • Figure 2: The Overall Pipeline of WikiCLIP. Given an entity's Wikipedia document, we use CLIP to extract patch-level features from the entity image and an LLM to obtain embeddings of its encyclopedic text description. The Vision-guided Knowledge Adaptation(VGKA) selects the informative text tokens guided by the visual feature to produce an entity representation. To further improve fine-grained discrimination, we introduce a hard negative synthesis strategy. This method generates challenging negative samples by replacing the original entity text description with that of a visually similar but semantically distinct entity. These synthetic hard negatives encourage the model to focus on subtle semantic differences.
  • Figure 3: Performance with different text length. We report the accuracy of WikiCLIP-S on the INFOSEEK validation set. The best performance is achieved at 256 text length.
  • Figure 4: Performance with varying training iterations and LLM choices. We report the accuracy of the INFOSEEK validation set of WikiCLIP using three different scales of LLMs, along with varying training iterations.
  • Figure 5: Performance with Different Ratios of Seen Entities. We evaluate models trained with varying ratios of seen entities. Seen Acc and Unseen Acc measure accuracy on test samples whose entities were present or absent, respectively, during training.
  • ...and 5 more figures