Table of Contents
Fetching ...

Scene-Text Grounding for Text-Based Video Question Answering

Sheng Zhou, Junbin Xiao, Xun Yang, Peipei Song, Dan Guo, Angela Yao, Meng Wang, Tat-Seng Chua

TL;DR

This work addresses the opacity and reliance on scene-text recognition in TextVideoQA by introducing Grounded TextVideoQA, which requires both answering questions and spatio-temporally grounding relevant scene-text regions. It proposes the T2S-QA model, a cascaded temporal-to-spatial grounding framework with a contrastive learning objective, and introduces ViTXT-GQA, a dataset providing grounded spatio-temporal annotations for robust evaluation. Experiments show that T2S-QA improves both grounding and QA over strong baselines, but reveal a large gap to human performance, highlighting OCR recognition and current evaluation strategies as key bottlenecks. The work provides substantial datasets, baselines, and analyses to spur progress toward interpretable and reliable TextVideoQA systems.

Abstract

Existing efforts in text-based video question answering (TextVideoQA) are criticized for their opaque decisionmaking and heavy reliance on scene-text recognition. In this paper, we propose to study Grounded TextVideoQA by forcing models to answer questions and spatio-temporally localize the relevant scene-text regions, thus decoupling QA from scenetext recognition and promoting research towards interpretable QA. The task has three-fold significance. First, it encourages scene-text evidence versus other short-cuts for answer predictions. Second, it directly accepts scene-text regions as visual answers, thus circumventing the problem of ineffective answer evaluation by stringent string matching. Third, it isolates the challenges inherited in VideoQA and scene-text recognition. This enables the diagnosis of the root causes for failure predictions, e.g., wrong QA or wrong scene-text recognition? To achieve Grounded TextVideoQA, we propose the T2S-QA model that highlights a disentangled temporal-to-spatial contrastive learning strategy for weakly-supervised scene-text grounding and grounded TextVideoQA. To facilitate evaluation, we construct a new dataset ViTXT-GQA which features 52K scene-text bounding boxes within 2.2K temporal segments related to 2K questions and 729 videos. With ViTXT-GQA, we perform extensive experiments and demonstrate the severe limitations of existing techniques in Grounded TextVideoQA. While T2S-QA achieves superior results, the large performance gap with human leaves ample space for improvement. Our further analysis of oracle scene-text inputs posits that the major challenge is scene-text recognition. To advance the research of Grounded TextVideoQA, our dataset and code are at https://github.com/zhousheng97/ViTXT-GQA.git

Scene-Text Grounding for Text-Based Video Question Answering

TL;DR

This work addresses the opacity and reliance on scene-text recognition in TextVideoQA by introducing Grounded TextVideoQA, which requires both answering questions and spatio-temporally grounding relevant scene-text regions. It proposes the T2S-QA model, a cascaded temporal-to-spatial grounding framework with a contrastive learning objective, and introduces ViTXT-GQA, a dataset providing grounded spatio-temporal annotations for robust evaluation. Experiments show that T2S-QA improves both grounding and QA over strong baselines, but reveal a large gap to human performance, highlighting OCR recognition and current evaluation strategies as key bottlenecks. The work provides substantial datasets, baselines, and analyses to spur progress toward interpretable and reliable TextVideoQA systems.

Abstract

Existing efforts in text-based video question answering (TextVideoQA) are criticized for their opaque decisionmaking and heavy reliance on scene-text recognition. In this paper, we propose to study Grounded TextVideoQA by forcing models to answer questions and spatio-temporally localize the relevant scene-text regions, thus decoupling QA from scenetext recognition and promoting research towards interpretable QA. The task has three-fold significance. First, it encourages scene-text evidence versus other short-cuts for answer predictions. Second, it directly accepts scene-text regions as visual answers, thus circumventing the problem of ineffective answer evaluation by stringent string matching. Third, it isolates the challenges inherited in VideoQA and scene-text recognition. This enables the diagnosis of the root causes for failure predictions, e.g., wrong QA or wrong scene-text recognition? To achieve Grounded TextVideoQA, we propose the T2S-QA model that highlights a disentangled temporal-to-spatial contrastive learning strategy for weakly-supervised scene-text grounding and grounded TextVideoQA. To facilitate evaluation, we construct a new dataset ViTXT-GQA which features 52K scene-text bounding boxes within 2.2K temporal segments related to 2K questions and 729 videos. With ViTXT-GQA, we perform extensive experiments and demonstrate the severe limitations of existing techniques in Grounded TextVideoQA. While T2S-QA achieves superior results, the large performance gap with human leaves ample space for improvement. Our further analysis of oracle scene-text inputs posits that the major challenge is scene-text recognition. To advance the research of Grounded TextVideoQA, our dataset and code are at https://github.com/zhousheng97/ViTXT-GQA.git
Paper Structure (23 sections, 12 equations, 7 figures, 11 tables)

This paper contains 23 sections, 12 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Comparison between existing research and our work for TextVideoQA. (a) Existing research has two major problems: 1) Opaque decision-making; they hardly tell if their answers (${\textit{e.g.}}$, “30”) are originated from the relevant scene texts in the videos, or attributed to other short-cuts. 2) Heavy reliance on scene-text recognition; their low QA accuracy could be due to a failure in decoding the textual answer (${\textit{e.g.}}$, "30 M.P.H.") from the corresponding scene text region. (b) We establish a novel pipeline by temporal-spatially localizing the scene text region and then decoding them into textual answers. We also enable direct evaluation on the grounded scene-text region.
  • Figure 2: Overview of our T2S-QA model. It mainly consists of three components: (1) the Feature Representation prepares the features of the question $\mathrm{Q}$, video frames $\mathrm{F}$, and OCR tokens $\mathrm{O}$; (2) the Contrastive Temporal-Spatial Grounding adopts a two-stage fine-grained grounding approach; (3) the Answer Decoder integrates the grounded frames $\mathrm{F}^{+}$, the grounded OCR tokens $\mathrm{O}^{+}$, and the question $\mathrm{Q}$ to achieve answer generation. In the Optimization process, we introduce a contrastive learning mechanism to help improve the question answering and answer grounding capabilities of our model.
  • Figure 3: Examples of spatial-temporal labels in ViTXT-GQA.
  • Figure 4: Analysis of annotated spatio-temporal labels.
  • Figure 5: Prompts for MLLMs to perform ViTXT-GQA.
  • ...and 2 more figures