Table of Contents
Fetching ...

Reverse Region-to-Entity Annotation for Pixel-Level Visual Entity Linking

Zhengfei Xu, Sijia Zhao, Yanchao Hao, Xiaolong Liu, Lili Li, Yuyang Yin, Bo Li, Xi Chen, Xin Xin

TL;DR

The paper defines Pixel-Level Visual Entity Linking (PL-VEL), a task that grounds pixel masks to knowledge-base entities to enhance fine-grained visual understanding. It introduces MaskOven-Wiki, a ~5M-annotation dataset built via a reverse annotation framework, achieving 94.8% annotation accuracy and enabling substantial improvements over zero-shot baselines. A visual semantic tokenization method aligns high-level semantic region cues with autoregressive ALD-code decoding in a vision-language model, delivering about a 5-point gain over baselines and supporting region-interacted attention. Overall, PL-VEL and MaskOven-Wiki push pixel-level grounding forward, with practical impact for VQA, visual reasoning, and detailed image captioning, while revealing that final linking accuracy remains around 25% and room for further improvements.

Abstract

Visual Entity Linking (VEL) is a crucial task for achieving fine-grained visual understanding, matching objects within images (visual mentions) to entities in a knowledge base. Previous VEL tasks rely on textual inputs, but writing queries for complex scenes can be challenging. Visual inputs like clicks or bounding boxes offer a more convenient alternative. Therefore, we propose a new task, Pixel-Level Visual Entity Linking (PL-VEL), which uses pixel masks from visual inputs to refer to objects, supplementing reference methods for VEL. To facilitate research on this task, we have constructed the MaskOVEN-Wiki dataset through an entirely automatic reverse region-entity annotation framework. This dataset contains over 5 million annotations aligning pixel-level regions with entity-level labels, which will advance visual understanding towards fine-grained. Moreover, as pixel masks correspond to semantic regions in an image, we enhance previous patch-interacted attention with region-interacted attention by a visual semantic tokenization approach. Manual evaluation results indicate that the reverse annotation framework achieved a 94.8% annotation success rate. Experimental results show that models trained on this dataset improved accuracy by 18 points compared to zero-shot models. Additionally, the semantic tokenization method achieved a 5-point accuracy improvement over the trained baseline.

Reverse Region-to-Entity Annotation for Pixel-Level Visual Entity Linking

TL;DR

The paper defines Pixel-Level Visual Entity Linking (PL-VEL), a task that grounds pixel masks to knowledge-base entities to enhance fine-grained visual understanding. It introduces MaskOven-Wiki, a ~5M-annotation dataset built via a reverse annotation framework, achieving 94.8% annotation accuracy and enabling substantial improvements over zero-shot baselines. A visual semantic tokenization method aligns high-level semantic region cues with autoregressive ALD-code decoding in a vision-language model, delivering about a 5-point gain over baselines and supporting region-interacted attention. Overall, PL-VEL and MaskOven-Wiki push pixel-level grounding forward, with practical impact for VQA, visual reasoning, and detailed image captioning, while revealing that final linking accuracy remains around 25% and room for further improvements.

Abstract

Visual Entity Linking (VEL) is a crucial task for achieving fine-grained visual understanding, matching objects within images (visual mentions) to entities in a knowledge base. Previous VEL tasks rely on textual inputs, but writing queries for complex scenes can be challenging. Visual inputs like clicks or bounding boxes offer a more convenient alternative. Therefore, we propose a new task, Pixel-Level Visual Entity Linking (PL-VEL), which uses pixel masks from visual inputs to refer to objects, supplementing reference methods for VEL. To facilitate research on this task, we have constructed the MaskOVEN-Wiki dataset through an entirely automatic reverse region-entity annotation framework. This dataset contains over 5 million annotations aligning pixel-level regions with entity-level labels, which will advance visual understanding towards fine-grained. Moreover, as pixel masks correspond to semantic regions in an image, we enhance previous patch-interacted attention with region-interacted attention by a visual semantic tokenization approach. Manual evaluation results indicate that the reverse annotation framework achieved a 94.8% annotation success rate. Experimental results show that models trained on this dataset improved accuracy by 18 points compared to zero-shot models. Additionally, the semantic tokenization method achieved a 5-point accuracy improvement over the trained baseline.

Paper Structure

This paper contains 38 sections, 4 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Overview of comparing text and pixel-based Visual Entity Linking (VEL) tasks
  • Figure 2: Overview of the annotation framework. (a) Comparison of direct and reverse annotation shows that direct annotation struggles to utilize existing entity labels effectively, whereas reverse annotation efficiently reduces the search space. (b) Knowledge-enhanced text prompt for segmentation models, built on intensional and extensional expansion.
  • Figure 3: The procedure of building MaskOven-Wiki. The illustration image is generated by AI chang2024fluxfastsoftwarebasedcommunication.
  • Figure 4: Entity category distribution in the evaluation set.
  • Figure 5: Distribution of MaskOven-Wiki: (a) distribution of entity categories; (b) comparison of the entity category distribution between MaskOven-Wiki and Oven-Wiki; (c) distribution of mask ratios for visual mentions in images.
  • ...and 5 more figures