Table of Contents
Fetching ...

Visual Text Matters: Improving Text-KVQA with Visual Text Entity Knowledge-aware Large Multimodal Assistant

Abhirama Subramanyam Penamakuri, Anand Mishra

TL;DR

The proposed VisTEL module harnesses a state-of-the-art visual text recognition engine and the power of a large multimodal model to jointly reason using textual and visual context obtained using surrounding cues in the image to link visual text entity to the correct knowledge base entity.

Abstract

We revisit knowledge-aware text-based visual question answering, also known as Text-KVQA, in the light of modern advancements in large multimodal models (LMMs), and make the following contributions: (i) We propose VisTEL - a principled approach to perform visual text entity linking. The proposed VisTEL module harnesses a state-of-the-art visual text recognition engine and the power of a large multimodal model to jointly reason using textual and visual context obtained using surrounding cues in the image to link the visual text entity to the correct knowledge base entity. (ii) We present KaLMA - a knowledge-aware large multimodal assistant that augments an LMM with knowledge associated with visual text entity in the image to arrive at an accurate answer. Further, we provide a comprehensive experimental analysis and comparison of our approach with traditional visual question answering, pre-large multimodal models, and large multimodal models, as well as prior top-performing approaches. Averaging over three splits of Text-KVQA, our proposed approach surpasses the previous best approach by a substantial 23.3% on an absolute scale and establishes a new state of the art. We make our implementation publicly available.

Visual Text Matters: Improving Text-KVQA with Visual Text Entity Knowledge-aware Large Multimodal Assistant

TL;DR

The proposed VisTEL module harnesses a state-of-the-art visual text recognition engine and the power of a large multimodal model to jointly reason using textual and visual context obtained using surrounding cues in the image to link visual text entity to the correct knowledge base entity.

Abstract

We revisit knowledge-aware text-based visual question answering, also known as Text-KVQA, in the light of modern advancements in large multimodal models (LMMs), and make the following contributions: (i) We propose VisTEL - a principled approach to perform visual text entity linking. The proposed VisTEL module harnesses a state-of-the-art visual text recognition engine and the power of a large multimodal model to jointly reason using textual and visual context obtained using surrounding cues in the image to link the visual text entity to the correct knowledge base entity. (ii) We present KaLMA - a knowledge-aware large multimodal assistant that augments an LMM with knowledge associated with visual text entity in the image to arrive at an accurate answer. Further, we provide a comprehensive experimental analysis and comparison of our approach with traditional visual question answering, pre-large multimodal models, and large multimodal models, as well as prior top-performing approaches. Averaging over three splits of Text-KVQA, our proposed approach surpasses the previous best approach by a substantial 23.3% on an absolute scale and establishes a new state of the art. We make our implementation publicly available.

Paper Structure

This paper contains 16 sections, 1 equation, 8 figures, 6 tables.

Figures (8)

  • Figure 1: (a) Text-kvqasingh2019strings: Given an image containing a named entity as visual text, e.g., "Domino's" in this illustration, the aim is to answer the question by leveraging explicit knowledge about the visual text entity. (b) Large Multimodal Models are one obvious choice for solving such tasks today. However, they alone are insufficient as they hallucinate on visual objects. (c) We propose a novel approach -- KaLMA that augments an lmm with specialized visual text recognition and retrieved relevant knowledge obtained using visual text entity linking by proposed VisTEL. Our approach establishes a new state-of-the-art for this task.
  • Figure 2: Challenges associated with Visual Text Entity Linking: (a) Visual text entity may appear as abbreviation instead of the entity name directly, e.g. "RBS" instead of "The Royal Bank of Scotland", (b) Visual text with varying font and stylized orientation pose a challenge to the recognizer, (c) Example of homonyms where visual text HP may refer to 'Hewlett Packard' (left) or 'Hindustan Petroleum' (right).
  • Figure 3: Illustration of VisTEL. We extract visual text from the given image using visual text recognition engine and, based on textual similarity, obtain $k$ candidate entities from the knowledge base. We fit OCRed text and the candidate entities into an instruction prompt template and encode the image using a visual encoder and the text prompt using an lmm embedding module to obtain $X_I$ and $X_T$, respectively. Once encoded, lmm generates the entity associated with the visual text in the image. Please refer to the Section \ref{['sec:vistel']}.
  • Figure 4: Overview of our proposed framework KaLMA. We first link the visual text in the image $I$ to the entity $E_I$ using VisTEL (Section \ref{['sec:vistel']}) and its associated knowledge $K_I$ is fetched. Then, we frame an instruction prompt with the question $Q$ and the knowledge $K_I$, and encode it using the ${\sc lmm}_{embedding}$ module $f$ to obtain textual features $X_{T_{Q:K_I}}$. We encode the image $I$ using a vision encoder to obtain visual features $X_I$. Then, we concatenate $X_I$ and $X_{T_{Q:K_I}}$ and feed them to the lmm to generate an accurate answer $A$ to the question $Q$. Instruction prompt templates used in our ablation study are shown in the bottom right box, where $T_I^{ocr}$ is the visual text of the image $I$.
  • Figure 5: A selection of our results as compared to implicit knowledge-based LMM approaches. Please refer Qualitative Results in Section \ref{['sec:results']} for observations. More results in Appendix \ref{['sec:app_more_results']}.
  • ...and 3 more figures