Table of Contents
Fetching ...

Grounding Language Models for Visual Entity Recognition

Zilin Xiao, Ming Gong, Paola Cascante-Bonilla, Xingyao Zhang, Jie Wu, Vicente Ordonez

TL;DR

This work targets open-domain Visual Entity Recognition, where answering questions about images requires grounding to a massive Wikipedia-like entity space. AutoVER introduces a retrieval-augmented, constrained decoding framework that uses a retrieval token and a dynamic prefix-tree to limit generation to grounded entity identifiers, guided by both query and entity representations. Training jointly optimizes a query-to-entity contrastive objective and language modeling, supplemented by two hard-negative mining strategies to address fine-grained visual distinctions. Results on Oven-Wiki show substantial improvements over baselines across seen, unseen, and query splits, with strong zero-shot transfer to A-OKVQA-Ent, and ablations validate the critical role of retrieval, decoding constraints, and hard-negative mining in reducing hallucinations and improving grounding.

Abstract

We introduce AutoVER, an Autoregressive model for Visual Entity Recognition. Our model extends an autoregressive Multi-modal Large Language Model by employing retrieval augmented constrained generation. It mitigates low performance on out-of-domain entities while excelling in queries that require visually-situated reasoning. Our method learns to distinguish similar entities within a vast label space by contrastively training on hard negative pairs in parallel with a sequence-to-sequence objective without an external retriever. During inference, a list of retrieved candidate answers explicitly guides language generation by removing invalid decoding paths. The proposed method achieves significant improvements across different dataset splits in the recently proposed Oven-Wiki benchmark. Accuracy on the Entity seen split rises from 32.7% to 61.5%. It also demonstrates superior performance on the unseen and query splits by a substantial double-digit margin.

Grounding Language Models for Visual Entity Recognition

TL;DR

This work targets open-domain Visual Entity Recognition, where answering questions about images requires grounding to a massive Wikipedia-like entity space. AutoVER introduces a retrieval-augmented, constrained decoding framework that uses a retrieval token and a dynamic prefix-tree to limit generation to grounded entity identifiers, guided by both query and entity representations. Training jointly optimizes a query-to-entity contrastive objective and language modeling, supplemented by two hard-negative mining strategies to address fine-grained visual distinctions. Results on Oven-Wiki show substantial improvements over baselines across seen, unseen, and query splits, with strong zero-shot transfer to A-OKVQA-Ent, and ablations validate the critical role of retrieval, decoding constraints, and hard-negative mining in reducing hallucinations and improving grounding.

Abstract

We introduce AutoVER, an Autoregressive model for Visual Entity Recognition. Our model extends an autoregressive Multi-modal Large Language Model by employing retrieval augmented constrained generation. It mitigates low performance on out-of-domain entities while excelling in queries that require visually-situated reasoning. Our method learns to distinguish similar entities within a vast label space by contrastively training on hard negative pairs in parallel with a sequence-to-sequence objective without an external retriever. During inference, a list of retrieved candidate answers explicitly guides language generation by removing invalid decoding paths. The proposed method achieves significant improvements across different dataset splits in the recently proposed Oven-Wiki benchmark. Accuracy on the Entity seen split rises from 32.7% to 61.5%. It also demonstrates superior performance on the unseen and query splits by a substantial double-digit margin.
Paper Structure (14 sections, 5 equations, 4 figures, 5 tables)

This paper contains 14 sections, 5 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: A representative query-entity pair from Oven-Wiki. We briefly illustrate the model inference process and compare predictions from PaLI-17B (in red) and AutoVER-7B (in green) which obtains the correct answer: ATR 42. AutoVER retrieves entity candidates without an external retriever (step 1), dynamically constructs a prefix tree (trie) (step 2), and performs decoding-time augmentation to guide autoregressive generation (step 3).
  • Figure 2: Joint training of in-batch contrastive learning and language modeling in AutoVER. For each training quadruple consisting of an entity image, an entity description, a query image and a query question, a lightweight Transformer encoder produces the fused entity representation $E_i$ (left half). A special retrieval token prompts the multimodal language model to generate the query representation $Q_i$. The query-to-entity contrastive training ($\mathrm{L}_{\text{query2ent}}$) encourages the correct retrieval of entities given the query pair, and the language modeling ($\mathrm{L}_{\text{LM}}$) helps the successful entity grounding.
  • Figure 3: Retrieval-augmented constrained decoding illustration of our proposed AutoVER inference process. The representation $Q$ will query a pre-cached entity database constructed using the multimodal entity encoder, and get the top-$k$ candidate entities. A prefix-constrained tree is dynamically built based on retrieved entity identifiers and guides the language model to autoregressively generate the next token, thereby ensuring the successful grounding of generated content.
  • Figure 4: Illustration of selected query image-question pairs, retrieved candidates and AutoVER-7B decisions. AutoVER adeptly captures slight variations in the query text and retrieves entirely different entity candidates, which forms the basis for the generative decisions from the language model.