Table of Contents
Fetching ...

SFA: Scan, Focus, and Amplify toward Guidance-aware Answering for Video TextVQA

Haibin He, Qihuang Zhong, Juhua Liu, Bo Du, Peng Wang, Jing Zhang

TL;DR

SFA tackles Video TextVQA by introducing a training-free, three-stage Scan-Focus-Amplify framework that guides Video-LLMs to text-centric regions in videos. It leverages a VTS to scan text locations, adaptive windows to preserve text integrity, a multimodal relevance scorer to select key regions, and amplification to improve textual clarity before LLM-based reasoning. The approach delivers state-of-the-art results across multiple public benchmarks and demonstrates strong generalization, particularly in challenging, text-dense or small-text scenarios. This work bridges fine-grained visual-text perception with holistic video understanding, offering a practical, scalable route to more reliable video-text reasoning.

Abstract

Video text-based visual question answering (Video TextVQA) task aims to answer questions about videos by leveraging the visual text appearing within the videos. This task poses significant challenges, requiring models to accurately perceive and comprehend scene text that varies in scale, orientation, and clarity across frames, while effectively integrating temporal and semantic context to generate precise answers. Moreover, the model must identify question-relevant textual cues and filter out redundant or irrelevant information to ensure answering is guided by the most relevant and informative cues. To address these challenges, we propose SFA, a training-free framework and the first Video-LLM-based method tailored for Video TextVQA, motivated by the human process of answering questions. By adaptively scanning video frames, selectively focusing on key regions, and directly amplifying them, SFA effectively guides the Video-LLM's attention toward essential cues, enabling it to generate more accurate answers. SFA achieves new state-of-the-art results across several public Video TextVQA datasets and surpasses previous methods by a substantial margin, demonstrating its effectiveness and generalizability.

SFA: Scan, Focus, and Amplify toward Guidance-aware Answering for Video TextVQA

TL;DR

SFA tackles Video TextVQA by introducing a training-free, three-stage Scan-Focus-Amplify framework that guides Video-LLMs to text-centric regions in videos. It leverages a VTS to scan text locations, adaptive windows to preserve text integrity, a multimodal relevance scorer to select key regions, and amplification to improve textual clarity before LLM-based reasoning. The approach delivers state-of-the-art results across multiple public benchmarks and demonstrates strong generalization, particularly in challenging, text-dense or small-text scenarios. This work bridges fine-grained visual-text perception with holistic video understanding, offering a practical, scalable route to more reliable video-text reasoning.

Abstract

Video text-based visual question answering (Video TextVQA) task aims to answer questions about videos by leveraging the visual text appearing within the videos. This task poses significant challenges, requiring models to accurately perceive and comprehend scene text that varies in scale, orientation, and clarity across frames, while effectively integrating temporal and semantic context to generate precise answers. Moreover, the model must identify question-relevant textual cues and filter out redundant or irrelevant information to ensure answering is guided by the most relevant and informative cues. To address these challenges, we propose SFA, a training-free framework and the first Video-LLM-based method tailored for Video TextVQA, motivated by the human process of answering questions. By adaptively scanning video frames, selectively focusing on key regions, and directly amplifying them, SFA effectively guides the Video-LLM's attention toward essential cues, enabling it to generate more accurate answers. SFA achieves new state-of-the-art results across several public Video TextVQA datasets and surpasses previous methods by a substantial margin, demonstrating its effectiveness and generalizability.

Paper Structure

This paper contains 21 sections, 4 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: An example to illustrate the limitations of both existing Video TextVQA methods and Video-LLMs. Video TextVQA methods exhibit strong video text perception (text comprehension) capabilities but often misinterpret video content (content comprehension), leading them to select incorrect textual cues when generating answers. Conversely, Video-LLMs demonstrate robust comprehension of video content but limited sensitivity to video text, causing frequent recognition errors and inaccurate responses, particularly when dealing with extreme small text. In contrast, our method achieves dual-level comprehension of both video text and video content, thereby enabling more accurate answers.
  • Figure 2: The pipeline of SFA. In the Scan stage, candidate regions containing video text are identified using a well-trained VTS model. An adaptive windowing mechanism is designed to prevent fixed-size windows from fragmenting text, thereby avoiding potential semantic incompleteness and inconsistency. During the Focus stage, a scoring model evaluates the importance of each region and retains at most one most key region per frame. Finally, in the Amplify stage, the selected key regions are restored to the original frame size and fed into the Video-LLM to produce the final answer.
  • Figure 3: Adaptive Windowing Mechanism. When scanning with fixed-size windows, text lines may be fragmented, potentially altering their semantic meaning (e.g., "MANCHESTER" being recognized as "MAN"). In contrast, adaptive-size windows enable adjustments to their dimensions while preserving the original aspect ratio, ensuring that each text line in the window is fully encompassed and thus maintaining semantic integrity and completeness.
  • Figure 4: The prompts used in SFA. The upper prompt (A) is designed for relevance assessment in the Focus stage, whereas the lower prompt (B) is employed for question answering in the Amplify stage.
  • Figure 5: Case studies. Qwen2.5-VL tends to overlook critical text, focus on irrelevant regions, or misrecognize text, whereas the proposed SFA effectively mitigates these issues.
  • ...and 1 more figures