A Generative Approach for Wikipedia-Scale Visual Entity Recognition

Mathilde Caron; Ahmet Iscen; Alireza Fathi; Cordelia Schmid

A Generative Approach for Wikipedia-Scale Visual Entity Recognition

Mathilde Caron, Ahmet Iscen, Alireza Fathi, Cordelia Schmid

TL;DR

This paper introduces a novel Generative Entity Recognition (GER)framework, which given an input image learns to auto-regressively decode a semantic and discriminative “code” identifying the target entity.

Abstract

In this paper, we address web-scale visual entity recognition, specifically the task of mapping a given query image to one of the 6 million existing entities in Wikipedia. One way of approaching a problem of such scale is using dual-encoder models (eg CLIP), where all the entity names and query images are embedded into a unified space, paving the way for an approximate k-NN search. Alternatively, it is also possible to re-purpose a captioning model to directly generate the entity names for a given image. In contrast, we introduce a novel Generative Entity Recognition (GER) framework, which given an input image learns to auto-regressively decode a semantic and discriminative ``code'' identifying the target entity. Our experiments demonstrate the efficacy of this GER paradigm, showcasing state-of-the-art performance on the challenging OVEN benchmark. GER surpasses strong captioning, dual-encoder, visual matching and hierarchical classification baselines, affirming its advantage in tackling the complexities of web-scale recognition.

A Generative Approach for Wikipedia-Scale Visual Entity Recognition

TL;DR

Abstract

Paper Structure (35 sections, 1 equation, 14 figures, 6 tables, 1 algorithm)

This paper contains 35 sections, 1 equation, 14 figures, 6 tables, 1 algorithm.

Introduction
Related work
Method
Problem definition
ger-ald: Creating ald codes for ger
Training
Baselines
Experiments
Experimental setting
Comparison with the state of the art
Comparison with baselines
Analysis and ablation study
Semantic versus atomic codes
ald versus captioning codes
Creating codes with ald
...and 20 more sections

Figures (14)

Figure 1: We introduce ger, a novel generative paradigm for web-scale visual entity recognition. We create compact semantic codes for each entity, and learn to auto-regressively generate them for a given query image at inference.
Figure 2: Overview of ger-ald method.(a) We utilize a text tokenizer to create compact and semantic codes, which represents each entity with short, but discriminative representations. (b) We learn a generative auto-regressive model, which learns to decode the correct code for given query image and text pair.
Figure 3: Semantic vs atomic codes. We report the relative improvement in $\%$ of ger-ald compared to ger-atomic in 3 scenarios: (i) limited pretraining data, (ii) limited model capacity and (iii) massive-scale label-space. Plots share a common experiment shown by $\mdblksquare$ which uses a pretraining dataset size of $27M$, Large model and 6M entity set. The setting reported in Tab. \ref{['tab:main_baseline']} is .
Figure 3: Ablation study of ger-ald codes. (left) Word tokens selection. (right) Tokens order. All variants use $L=4$. Default is in top rows. Non language-based ger-atomic gets $11.4$ top-1.
Figure 4: Accuracy per entity name length for ger-ald versus ger-caption codes. (left): Accuracy averaged per entity name length. (right): Qualitative examples of predictions for long entity names. Code tokens are symbolized between brackets.
...and 9 more figures

A Generative Approach for Wikipedia-Scale Visual Entity Recognition

TL;DR

Abstract

A Generative Approach for Wikipedia-Scale Visual Entity Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (14)