Table of Contents
Fetching ...

TALENT: Table VQA via Augmented Language-Enhanced Natural-text Transcription

Guo Yutong, Wanying Wang, Yue Wu, Zichen Miao, Haoyu Wang

TL;DR

This work addresses Table VQA on resource-constrained devices by introducing TALENT, a lightweight framework where a small VLM provides dual outputs—precise OCR spans and natural language narration—that are fed into an LLM for reasoning. By treating the VLM as a perception–narration module and the LLM as the central reasoner, TALENT achieves competitive accuracy with substantially fewer parameters than large end-to-end VLMs, as demonstrated on TableVQA-Bench and the newly introduced ReTabVQA dataset, which requires multi-step quantitative reasoning. Key contributions include the dual representation design, explicit prompt strategies that enforce units and contextual grounding, and a new challenging benchmark for compositional table reasoning. The results show that the LLM’s reasoning capacity dominates as models scale, supporting an efficient and deployable on-device Table VQA solution that robustly handles complex layouts and units. Overall, TALENT bridges symbolic precision and semantic reasoning to enable accurate, efficient table understanding on mobile and edge devices.

Abstract

Table Visual Question Answering (Table VQA) is typically addressed by large vision-language models (VLMs). While such models can answer directly from images, they often miss fine-grained details unless scaled to very large sizes, which are computationally prohibitive, especially for mobile deployment. A lighter alternative is to have a small VLM perform OCR and then use a large language model (LLM) to reason over structured outputs such as Markdown tables. However, these representations are not naturally optimized for LLMs and still introduce substantial errors. We propose TALENT (Table VQA via Augmented Language-Enhanced Natural-text Transcription), a lightweight framework that leverages dual representations of tables. TALENT prompts a small VLM to produce both OCR text and natural language narration, then combines them with the question for reasoning by an LLM. This reframes Table VQA as an LLM-centric multimodal reasoning task, where the VLM serves as a perception-narration module rather than a monolithic solver. Additionally, we construct ReTabVQA, a more challenging Table VQA dataset requiring multi-step quantitative reasoning over table images. Experiments show that TALENT enables a small VLM-LLM combination to match or surpass a single large VLM at significantly lower computational cost on both public datasets and ReTabVQA.

TALENT: Table VQA via Augmented Language-Enhanced Natural-text Transcription

TL;DR

This work addresses Table VQA on resource-constrained devices by introducing TALENT, a lightweight framework where a small VLM provides dual outputs—precise OCR spans and natural language narration—that are fed into an LLM for reasoning. By treating the VLM as a perception–narration module and the LLM as the central reasoner, TALENT achieves competitive accuracy with substantially fewer parameters than large end-to-end VLMs, as demonstrated on TableVQA-Bench and the newly introduced ReTabVQA dataset, which requires multi-step quantitative reasoning. Key contributions include the dual representation design, explicit prompt strategies that enforce units and contextual grounding, and a new challenging benchmark for compositional table reasoning. The results show that the LLM’s reasoning capacity dominates as models scale, supporting an efficient and deployable on-device Table VQA solution that robustly handles complex layouts and units. Overall, TALENT bridges symbolic precision and semantic reasoning to enable accurate, efficient table understanding on mobile and edge devices.

Abstract

Table Visual Question Answering (Table VQA) is typically addressed by large vision-language models (VLMs). While such models can answer directly from images, they often miss fine-grained details unless scaled to very large sizes, which are computationally prohibitive, especially for mobile deployment. A lighter alternative is to have a small VLM perform OCR and then use a large language model (LLM) to reason over structured outputs such as Markdown tables. However, these representations are not naturally optimized for LLMs and still introduce substantial errors. We propose TALENT (Table VQA via Augmented Language-Enhanced Natural-text Transcription), a lightweight framework that leverages dual representations of tables. TALENT prompts a small VLM to produce both OCR text and natural language narration, then combines them with the question for reasoning by an LLM. This reframes Table VQA as an LLM-centric multimodal reasoning task, where the VLM serves as a perception-narration module rather than a monolithic solver. Additionally, we construct ReTabVQA, a more challenging Table VQA dataset requiring multi-step quantitative reasoning over table images. Experiments show that TALENT enables a small VLM-LLM combination to match or surpass a single large VLM at significantly lower computational cost on both public datasets and ReTabVQA.

Paper Structure

This paper contains 26 sections, 5 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: The critical dilemma in Table VQA. Existing methods often result in failure (red) by being either too computationally heavy or losing semantic context. Our goal is to achieve an accurate and efficient solution (green).
  • Figure 2: Overall framework of TALENT. TALENT redefines Table VQA by leveraging the VLM as a perception–narration module that generates both symbolic OCR spans and natural-language descriptions of tables. These dual representations, combined with the user question, are processed by an LLM as the central reasoning engine, enabling efficient and robust multimodal table understanding.
  • Figure 3: The prompt used for OCR.
  • Figure 4: The prompt used for Natural language narration.
  • Figure 5: Example Table from the ReTabVQA Dataset.
  • ...and 2 more figures