Table of Contents
Fetching ...

Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach

Mathilde Caron, Alireza Fathi, Cordelia Schmid, Ahmet Iscen

TL;DR

This paper proposes a novel methodology to curate a web-scale visual entity recognition dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation, and uses the multimodal LLM to enrich the dataset by generating question-answer pairs and a grounded finegrained textual description.

Abstract

Web-scale visual entity recognition, the task of associating images with their corresponding entities within vast knowledge bases like Wikipedia, presents significant challenges due to the lack of clean, large-scale training data. In this paper, we propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation. Instead of relying on the multimodal LLM to directly annotate data, which we found to be suboptimal, we prompt it to reason about potential candidate entity labels by accessing additional contextually relevant information (such as Wikipedia), resulting in more accurate annotations. We further use the multimodal LLM to enrich the dataset by generating question-answer pairs and a grounded finegrained textual description (referred to as "rationale") that explains the connection between images and their assigned entities. Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks (e.g. +6.9% improvement in OVEN entity task), underscoring the importance of high-quality training data in this domain.

Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach

TL;DR

This paper proposes a novel methodology to curate a web-scale visual entity recognition dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation, and uses the multimodal LLM to enrich the dataset by generating question-answer pairs and a grounded finegrained textual description.

Abstract

Web-scale visual entity recognition, the task of associating images with their corresponding entities within vast knowledge bases like Wikipedia, presents significant challenges due to the lack of clean, large-scale training data. In this paper, we propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation. Instead of relying on the multimodal LLM to directly annotate data, which we found to be suboptimal, we prompt it to reason about potential candidate entity labels by accessing additional contextually relevant information (such as Wikipedia), resulting in more accurate annotations. We further use the multimodal LLM to enrich the dataset by generating question-answer pairs and a grounded finegrained textual description (referred to as "rationale") that explains the connection between images and their assigned entities. Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks (e.g. +6.9% improvement in OVEN entity task), underscoring the importance of high-quality training data in this domain.

Paper Structure

This paper contains 24 sections, 2 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Two failure cases of the visual entity recognition dataset of caron2024generative. Our proposed method overcomes these limitations by prompting a multimodal LLM to correct candidate entities. The LLM has access to relevant context such as the candidate entity Wikipedia page and the input image-caption pair. We also enrich the dataset with rationales and question/answer pairs covering diverse entities.
  • Figure 2: LLM-Refined Entity-WebLI" (REW) dataset. We propose a method to refine the Entity-WebLI dataset of caron2024generative by prompting a multimodal LLM to verify and correct Wikipedia entities. We also prompt the multimodal LLM to output visually grounded rationales and question/answer pairs about diverse attributes of the image. Complete prompts are in Appendix \ref{['ap:prompts']}.
  • Figure 3: Qualitative analysis of the importance of the entity verification and correction step.
  • Figure 4: Qualitative examples of entities, rationales and question-answer pairs obtained with the multi-modal LLM. Our prompt encourage asking questions about diverse entities in the image.
  • Figure 5: Qualitative examples of suboptimal annotations in OVEN benchmark. We show the input question, input image, OVEN ground truth entity as well as the top-5 predictions of our model.