Table of Contents
Fetching ...

Scene Text Visual Question Answering

Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusiñol, Ernest Valveny, C. V. Jawahar, Dimosthenis Karatzas

TL;DR

This work introduces ST-VQA, a dataset and framework that require reading scene text to answer questions, addressing a gap in traditional VQA where text is neglected. It provides images from diverse sources, three increasingly challenging tasks with per-image and global dictionaries, and a new ANLS metric to jointly evaluate reasoning and OCR accuracy. Baseline experiments show that incorporating textual information improves VQA performance and that purely visual models underperform compared to text-aware approaches, highlighting the need for end-to-end reading systems. The paper also outlines future directions toward generative, multi-word answers and better integration of language priors in scene-text VQA, with an online evaluation service for public benchmarking.</

Abstract

Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the VQA process. We use this dataset to define a series of tasks of increasing difficulty for which reading the scene text in the context provided by the visual information is necessary to reason and generate an appropriate answer. We propose a new evaluation metric for these tasks to account both for reasoning errors as well as shortcomings of the text recognition module. In addition we put forward a series of baseline methods, which provide further insight to the newly released dataset, and set the scene for further research.

Scene Text Visual Question Answering

TL;DR

This work introduces ST-VQA, a dataset and framework that require reading scene text to answer questions, addressing a gap in traditional VQA where text is neglected. It provides images from diverse sources, three increasingly challenging tasks with per-image and global dictionaries, and a new ANLS metric to jointly evaluate reasoning and OCR accuracy. Baseline experiments show that incorporating textual information improves VQA performance and that purely visual models underperform compared to text-aware approaches, highlighting the need for end-to-end reading systems. The paper also outlines future directions toward generative, multi-word answers and better integration of language priors in scene-text VQA, with an online evaluation service for public benchmarking.</

Abstract

Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image. In this work, we present a new dataset, ST-VQA, that aims to highlight the importance of exploiting high-level semantic information present in images as textual cues in the VQA process. We use this dataset to define a series of tasks of increasing difficulty for which reading the scene text in the context provided by the visual information is necessary to reason and generate an appropriate answer. We propose a new evaluation metric for these tasks to account both for reasoning errors as well as shortcomings of the text recognition module. In addition we put forward a series of baseline methods, which provide further insight to the newly released dataset, and set the scene for further research.

Paper Structure

This paper contains 11 sections, 2 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Recognising and interpreting textual content is essential for scene understanding. In the Scene Text Visual Question Answering (ST-VQA) dataset leveraging textual information in the image is the only way to solve the QA task.
  • Figure 2: Percentage of questions (top) and answers (bottom) that contain a specific number of words.
  • Figure 3: Distribution of questions in the ST-VQA train set by their starting 4-grams (ordered from center to outwards). Words with a small contribution are not shown for better visualization.
  • Figure 4: Distribution of answers for different types of questions in the ST-VQA train set. Each color represents a different unique answer.
  • Figure 5: Results of baseline methods in the open vocabulary task of ST-VQA by question type.
  • ...and 1 more figures