Table of Contents
Fetching ...

Text-VQA Aug: Pipelined Harnessing of Large Multimodal Models for Automated Synthesis

Soham Joshi, Shwet Kamal Mishra, Viswanath Gopalakrishnan

TL;DR

The paper tackles the scarcity of task-specific text-VQA pretraining data by introducing Text-VQA Aug, a training-free pipeline that leverages large multimodal models to automatically synthesize QA pairs from images with scene text. By integrating GLASS for OCR, Kosmos-2 for local grounding, LLaVA-R for object-crop captioning, and Intel Neural Chat 7B for question generation—with a post hoc LLM-based validation—the approach yields roughly 72K QA pairs across 44K images. This demonstrates a scalable method to create high-quality text-VQA data without manual annotation, potentially improving downstream model pretraining. The work also outlines practical extensions and applications across accessibility, retail, healthcare, and security domains.

Abstract

Creation of large-scale databases for Visual Question Answering tasks pertaining to the text data in a scene (text-VQA) involves skilful human annotation, which is tedious and challenging. With the advent of foundation models that handle vision and language modalities, and with the maturity of OCR systems, it is the need of the hour to establish an end-to-end pipeline that can synthesize Question-Answer (QA) pairs based on scene-text from a given image. We propose a pipeline for automated synthesis for text-VQA dataset that can produce faithful QA pairs, and which scales up with the availability of scene text data. Our proposed method harnesses the capabilities of multiple models and algorithms involving OCR detection and recognition (text spotting), region of interest (ROI) detection, caption generation, and question generation. These components are streamlined into a cohesive pipeline to automate the synthesis and validation of QA pairs. To the best of our knowledge, this is the first pipeline proposed to automatically synthesize and validate a large-scale text-VQA dataset comprising around 72K QA pairs based on around 44K images.

Text-VQA Aug: Pipelined Harnessing of Large Multimodal Models for Automated Synthesis

TL;DR

The paper tackles the scarcity of task-specific text-VQA pretraining data by introducing Text-VQA Aug, a training-free pipeline that leverages large multimodal models to automatically synthesize QA pairs from images with scene text. By integrating GLASS for OCR, Kosmos-2 for local grounding, LLaVA-R for object-crop captioning, and Intel Neural Chat 7B for question generation—with a post hoc LLM-based validation—the approach yields roughly 72K QA pairs across 44K images. This demonstrates a scalable method to create high-quality text-VQA data without manual annotation, potentially improving downstream model pretraining. The work also outlines practical extensions and applications across accessibility, retail, healthcare, and security domains.

Abstract

Creation of large-scale databases for Visual Question Answering tasks pertaining to the text data in a scene (text-VQA) involves skilful human annotation, which is tedious and challenging. With the advent of foundation models that handle vision and language modalities, and with the maturity of OCR systems, it is the need of the hour to establish an end-to-end pipeline that can synthesize Question-Answer (QA) pairs based on scene-text from a given image. We propose a pipeline for automated synthesis for text-VQA dataset that can produce faithful QA pairs, and which scales up with the availability of scene text data. Our proposed method harnesses the capabilities of multiple models and algorithms involving OCR detection and recognition (text spotting), region of interest (ROI) detection, caption generation, and question generation. These components are streamlined into a cohesive pipeline to automate the synthesis and validation of QA pairs. To the best of our knowledge, this is the first pipeline proposed to automatically synthesize and validate a large-scale text-VQA dataset comprising around 72K QA pairs based on around 44K images.

Paper Structure

This paper contains 16 sections, 7 figures, 1 algorithm.

Figures (7)

  • Figure 1: An illustrative example of the input image and the intermediate outputs in our pipeline. The green boxes in the image represent the object crops on which the captions are generated. The corresponding answers are displayed in parentheses.
  • Figure 2: System Overview. The blue boxes represent the pre-trained large multimodal models, and the yellow boxes represent specific algorithmic modules in our pipeline.
  • Figure 3: Answer Selection using image description (summary) and OCR tokens.
  • Figure 4: Detailed analysis of the synthesized Text-VQA aug dataset.
  • Figure 5: Frequently occurring phrases from each question type.
  • ...and 2 more figures