Grounding Language Models for Visual Entity Recognition

Zilin Xiao; Ming Gong; Paola Cascante-Bonilla; Xingyao Zhang; Jie Wu; Vicente Ordonez

Grounding Language Models for Visual Entity Recognition

Zilin Xiao, Ming Gong, Paola Cascante-Bonilla, Xingyao Zhang, Jie Wu, Vicente Ordonez

TL;DR

This work targets open-domain Visual Entity Recognition, where answering questions about images requires grounding to a massive Wikipedia-like entity space. AutoVER introduces a retrieval-augmented, constrained decoding framework that uses a retrieval token and a dynamic prefix-tree to limit generation to grounded entity identifiers, guided by both query and entity representations. Training jointly optimizes a query-to-entity contrastive objective and language modeling, supplemented by two hard-negative mining strategies to address fine-grained visual distinctions. Results on Oven-Wiki show substantial improvements over baselines across seen, unseen, and query splits, with strong zero-shot transfer to A-OKVQA-Ent, and ablations validate the critical role of retrieval, decoding constraints, and hard-negative mining in reducing hallucinations and improving grounding.

Abstract

We introduce AutoVER, an Autoregressive model for Visual Entity Recognition. Our model extends an autoregressive Multi-modal Large Language Model by employing retrieval augmented constrained generation. It mitigates low performance on out-of-domain entities while excelling in queries that require visually-situated reasoning. Our method learns to distinguish similar entities within a vast label space by contrastively training on hard negative pairs in parallel with a sequence-to-sequence objective without an external retriever. During inference, a list of retrieved candidate answers explicitly guides language generation by removing invalid decoding paths. The proposed method achieves significant improvements across different dataset splits in the recently proposed Oven-Wiki benchmark. Accuracy on the Entity seen split rises from 32.7% to 61.5%. It also demonstrates superior performance on the unseen and query splits by a substantial double-digit margin.

Grounding Language Models for Visual Entity Recognition

TL;DR

Abstract

Paper Structure (14 sections, 5 equations, 4 figures, 5 tables)

This paper contains 14 sections, 5 equations, 4 figures, 5 tables.

Introduction
Related Work
Methodology
Problem Definition
Model Overview
Hard-Negative Mining
Retrieval-augmented Constrained Decoding
Experiments
Settings
Main Results
Zero-shot Generalization Results
Ablation Study and Discussion
Case Study
Conclusion

Figures (4)

Figure 1: A representative query-entity pair from Oven-Wiki. We briefly illustrate the model inference process and compare predictions from PaLI-17B (in red) and AutoVER-7B (in green) which obtains the correct answer: ATR 42. AutoVER retrieves entity candidates without an external retriever (step 1), dynamically constructs a prefix tree (trie) (step 2), and performs decoding-time augmentation to guide autoregressive generation (step 3).
Figure 2: Joint training of in-batch contrastive learning and language modeling in AutoVER. For each training quadruple consisting of an entity image, an entity description, a query image and a query question, a lightweight Transformer encoder produces the fused entity representation $E_i$ (left half). A special retrieval token prompts the multimodal language model to generate the query representation $Q_i$. The query-to-entity contrastive training ($\mathrm{L}_{\text{query2ent}}$) encourages the correct retrieval of entities given the query pair, and the language modeling ($\mathrm{L}_{\text{LM}}$) helps the successful entity grounding.
Figure 3: Retrieval-augmented constrained decoding illustration of our proposed AutoVER inference process. The representation $Q$ will query a pre-cached entity database constructed using the multimodal entity encoder, and get the top-$k$ candidate entities. A prefix-constrained tree is dynamically built based on retrieved entity identifiers and guides the language model to autoregressively generate the next token, thereby ensuring the successful grounding of generated content.
Figure 4: Illustration of selected query image-question pairs, retrieved candidates and AutoVER-7B decisions. AutoVER adeptly captures slight variations in the query text and retrieves entirely different entity candidates, which forms the basis for the generative decisions from the language model.

Grounding Language Models for Visual Entity Recognition

TL;DR

Abstract

Grounding Language Models for Visual Entity Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (4)