Reverse Region-to-Entity Annotation for Pixel-Level Visual Entity Linking
Zhengfei Xu, Sijia Zhao, Yanchao Hao, Xiaolong Liu, Lili Li, Yuyang Yin, Bo Li, Xi Chen, Xin Xin
TL;DR
The paper defines Pixel-Level Visual Entity Linking (PL-VEL), a task that grounds pixel masks to knowledge-base entities to enhance fine-grained visual understanding. It introduces MaskOven-Wiki, a ~5M-annotation dataset built via a reverse annotation framework, achieving 94.8% annotation accuracy and enabling substantial improvements over zero-shot baselines. A visual semantic tokenization method aligns high-level semantic region cues with autoregressive ALD-code decoding in a vision-language model, delivering about a 5-point gain over baselines and supporting region-interacted attention. Overall, PL-VEL and MaskOven-Wiki push pixel-level grounding forward, with practical impact for VQA, visual reasoning, and detailed image captioning, while revealing that final linking accuracy remains around 25% and room for further improvements.
Abstract
Visual Entity Linking (VEL) is a crucial task for achieving fine-grained visual understanding, matching objects within images (visual mentions) to entities in a knowledge base. Previous VEL tasks rely on textual inputs, but writing queries for complex scenes can be challenging. Visual inputs like clicks or bounding boxes offer a more convenient alternative. Therefore, we propose a new task, Pixel-Level Visual Entity Linking (PL-VEL), which uses pixel masks from visual inputs to refer to objects, supplementing reference methods for VEL. To facilitate research on this task, we have constructed the MaskOVEN-Wiki dataset through an entirely automatic reverse region-entity annotation framework. This dataset contains over 5 million annotations aligning pixel-level regions with entity-level labels, which will advance visual understanding towards fine-grained. Moreover, as pixel masks correspond to semantic regions in an image, we enhance previous patch-interacted attention with region-interacted attention by a visual semantic tokenization approach. Manual evaluation results indicate that the reverse annotation framework achieved a 94.8% annotation success rate. Experimental results show that models trained on this dataset improved accuracy by 18 points compared to zero-shot models. Additionally, the semantic tokenization method achieved a 5-point accuracy improvement over the trained baseline.
