Grounding Language Models for Visual Entity Recognition
Zilin Xiao, Ming Gong, Paola Cascante-Bonilla, Xingyao Zhang, Jie Wu, Vicente Ordonez
TL;DR
This work targets open-domain Visual Entity Recognition, where answering questions about images requires grounding to a massive Wikipedia-like entity space. AutoVER introduces a retrieval-augmented, constrained decoding framework that uses a retrieval token and a dynamic prefix-tree to limit generation to grounded entity identifiers, guided by both query and entity representations. Training jointly optimizes a query-to-entity contrastive objective and language modeling, supplemented by two hard-negative mining strategies to address fine-grained visual distinctions. Results on Oven-Wiki show substantial improvements over baselines across seen, unseen, and query splits, with strong zero-shot transfer to A-OKVQA-Ent, and ablations validate the critical role of retrieval, decoding constraints, and hard-negative mining in reducing hallucinations and improving grounding.
Abstract
We introduce AutoVER, an Autoregressive model for Visual Entity Recognition. Our model extends an autoregressive Multi-modal Large Language Model by employing retrieval augmented constrained generation. It mitigates low performance on out-of-domain entities while excelling in queries that require visually-situated reasoning. Our method learns to distinguish similar entities within a vast label space by contrastively training on hard negative pairs in parallel with a sequence-to-sequence objective without an external retriever. During inference, a list of retrieved candidate answers explicitly guides language generation by removing invalid decoding paths. The proposed method achieves significant improvements across different dataset splits in the recently proposed Oven-Wiki benchmark. Accuracy on the Entity seen split rises from 32.7% to 61.5%. It also demonstrates superior performance on the unseen and query splits by a substantial double-digit margin.
