Table of Contents
Fetching ...

A Generative Approach for Wikipedia-Scale Visual Entity Recognition

Mathilde Caron, Ahmet Iscen, Alireza Fathi, Cordelia Schmid

TL;DR

This paper introduces a novel Generative Entity Recognition (GER)framework, which given an input image learns to auto-regressively decode a semantic and discriminative “code” identifying the target entity.

Abstract

In this paper, we address web-scale visual entity recognition, specifically the task of mapping a given query image to one of the 6 million existing entities in Wikipedia. One way of approaching a problem of such scale is using dual-encoder models (eg CLIP), where all the entity names and query images are embedded into a unified space, paving the way for an approximate k-NN search. Alternatively, it is also possible to re-purpose a captioning model to directly generate the entity names for a given image. In contrast, we introduce a novel Generative Entity Recognition (GER) framework, which given an input image learns to auto-regressively decode a semantic and discriminative ``code'' identifying the target entity. Our experiments demonstrate the efficacy of this GER paradigm, showcasing state-of-the-art performance on the challenging OVEN benchmark. GER surpasses strong captioning, dual-encoder, visual matching and hierarchical classification baselines, affirming its advantage in tackling the complexities of web-scale recognition.

A Generative Approach for Wikipedia-Scale Visual Entity Recognition

TL;DR

This paper introduces a novel Generative Entity Recognition (GER)framework, which given an input image learns to auto-regressively decode a semantic and discriminative “code” identifying the target entity.

Abstract

In this paper, we address web-scale visual entity recognition, specifically the task of mapping a given query image to one of the 6 million existing entities in Wikipedia. One way of approaching a problem of such scale is using dual-encoder models (eg CLIP), where all the entity names and query images are embedded into a unified space, paving the way for an approximate k-NN search. Alternatively, it is also possible to re-purpose a captioning model to directly generate the entity names for a given image. In contrast, we introduce a novel Generative Entity Recognition (GER) framework, which given an input image learns to auto-regressively decode a semantic and discriminative ``code'' identifying the target entity. Our experiments demonstrate the efficacy of this GER paradigm, showcasing state-of-the-art performance on the challenging OVEN benchmark. GER surpasses strong captioning, dual-encoder, visual matching and hierarchical classification baselines, affirming its advantage in tackling the complexities of web-scale recognition.
Paper Structure (35 sections, 1 equation, 14 figures, 6 tables, 1 algorithm)

This paper contains 35 sections, 1 equation, 14 figures, 6 tables, 1 algorithm.

Figures (14)

  • Figure 1: We introduce ger, a novel generative paradigm for web-scale visual entity recognition. We create compact semantic codes for each entity, and learn to auto-regressively generate them for a given query image at inference.
  • Figure 2: Overview of ger-ald method.(a) We utilize a text tokenizer to create compact and semantic codes, which represents each entity with short, but discriminative representations. (b) We learn a generative auto-regressive model, which learns to decode the correct code for given query image and text pair.
  • Figure 3: Semantic vs atomic codes. We report the relative improvement in $\%$ of ger-ald compared to ger-atomic in 3 scenarios: (i) limited pretraining data, (ii) limited model capacity and (iii) massive-scale label-space. Plots share a common experiment shown by $\mdblksquare$ which uses a pretraining dataset size of $27M$, Large model and 6M entity set. The setting reported in Tab. \ref{['tab:main_baseline']} is .
  • Figure 3: Ablation study of ger-ald codes. (left) Word tokens selection. (right) Tokens order. All variants use $L=4$. Default is in top rows. Non language-based ger-atomic gets $11.4$ top-1.
  • Figure 4: Accuracy per entity name length for ger-ald versus ger-caption codes. (left): Accuracy averaged per entity name length. (right): Qualitative examples of predictions for long entity names. Code tokens are symbolized between brackets.
  • ...and 9 more figures