Table of Contents
Fetching ...

EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering

Sheng Zhou, Junbin Xiao, Qingyun Li, Yicong Li, Xun Yang, Dan Guo, Meng Wang, Tat-Seng Chua, Angela Yao

TL;DR

EgoTextVQA introduces a realistic egocentric scene-text QA benchmark with timestamped QA across Outdoor and Indoor settings, enabling live, multi-frame reasoning about text in dynamic environments. The study benchmarks 10 modern multimodal LLMs, finding that even the best models (around 33-39% accuracy) lag behind humans, and highlighting temporal grounding, high-resolution inputs, and OCR augmentation as key levers for improvement. Through extensive ablations and heuristic explorations, the authors reveal that combining video context with scene-text information yields the best performance and provide actionable guidance for future model designs and data collection. The dataset, analysis, and prompts offer a solid testbed for advancing real-time egocentric scene-text QA, with practical implications for assistive AI in everyday tasks.

Abstract

We introduce EgoTextVQA, a novel and rigorously constructed benchmark for egocentric QA assistance involving scene text. EgoTextVQA contains 1.5K ego-view videos and 7K scene-text aware questions that reflect real user needs in outdoor driving and indoor house-keeping activities. The questions are designed to elicit identification and reasoning on scene text in an egocentric and dynamic environment. With EgoTextVQA, we comprehensively evaluate 10 prominent multimodal large language models. Currently, all models struggle, and the best results (Gemini 1.5 Pro) are around 33\% accuracy, highlighting the severe deficiency of these techniques in egocentric QA assistance. Our further investigations suggest that precise temporal grounding and multi-frame reasoning, along with high resolution and auxiliary scene-text inputs, are key for better performance. With thorough analyses and heuristic suggestions, we hope EgoTextVQA can serve as a solid testbed for research in egocentric scene-text QA assistance. Our dataset is released at: https://github.com/zhousheng97/EgoTextVQA.

EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering

TL;DR

EgoTextVQA introduces a realistic egocentric scene-text QA benchmark with timestamped QA across Outdoor and Indoor settings, enabling live, multi-frame reasoning about text in dynamic environments. The study benchmarks 10 modern multimodal LLMs, finding that even the best models (around 33-39% accuracy) lag behind humans, and highlighting temporal grounding, high-resolution inputs, and OCR augmentation as key levers for improvement. Through extensive ablations and heuristic explorations, the authors reveal that combining video context with scene-text information yields the best performance and provide actionable guidance for future model designs and data collection. The dataset, analysis, and prompts offer a solid testbed for advancing real-time egocentric scene-text QA, with practical implications for assistive AI in everyday tasks.

Abstract

We introduce EgoTextVQA, a novel and rigorously constructed benchmark for egocentric QA assistance involving scene text. EgoTextVQA contains 1.5K ego-view videos and 7K scene-text aware questions that reflect real user needs in outdoor driving and indoor house-keeping activities. The questions are designed to elicit identification and reasoning on scene text in an egocentric and dynamic environment. With EgoTextVQA, we comprehensively evaluate 10 prominent multimodal large language models. Currently, all models struggle, and the best results (Gemini 1.5 Pro) are around 33\% accuracy, highlighting the severe deficiency of these techniques in egocentric QA assistance. Our further investigations suggest that precise temporal grounding and multi-frame reasoning, along with high resolution and auxiliary scene-text inputs, are key for better performance. With thorough analyses and heuristic suggestions, we hope EgoTextVQA can serve as a solid testbed for research in egocentric scene-text QA assistance. Our dataset is released at: https://github.com/zhousheng97/EgoTextVQA.

Paper Structure

This paper contains 23 sections, 9 figures, 17 tables.

Figures (9)

  • Figure 1: Our EgoTextVQA aims for QA assistance involving scene text from an ego-perspective mainly in outdoor driving (EgoTextVQA-Outdoor) and indoor house-keeping (EgoTextVQA-Indoor), with the questions reflecting the real user needs yet without the visual focus on scene text. Benchmarking results show that all models struggle on EgoTextVQA, highlighting continued efforts for improvements.
  • Figure 2: Examples of EgoTextVQA. Scene text plays pivotal role in understanding and answering the questions which reflect real user needs. Yet, the videos are without the visual focus on scene text.
  • Figure 3: Distribution of QAs and OCR numbers.
  • Figure 4: Result Visualization.
  • Figure 5: Performance of MLLMs on the real-time QA subset of EgoTextVQA-Outdoor ($\sim$623 QA pairs).
  • ...and 4 more figures