TALENT: Table VQA via Augmented Language-Enhanced Natural-text Transcription
Guo Yutong, Wanying Wang, Yue Wu, Zichen Miao, Haoyu Wang
TL;DR
This work addresses Table VQA on resource-constrained devices by introducing TALENT, a lightweight framework where a small VLM provides dual outputs—precise OCR spans and natural language narration—that are fed into an LLM for reasoning. By treating the VLM as a perception–narration module and the LLM as the central reasoner, TALENT achieves competitive accuracy with substantially fewer parameters than large end-to-end VLMs, as demonstrated on TableVQA-Bench and the newly introduced ReTabVQA dataset, which requires multi-step quantitative reasoning. Key contributions include the dual representation design, explicit prompt strategies that enforce units and contextual grounding, and a new challenging benchmark for compositional table reasoning. The results show that the LLM’s reasoning capacity dominates as models scale, supporting an efficient and deployable on-device Table VQA solution that robustly handles complex layouts and units. Overall, TALENT bridges symbolic precision and semantic reasoning to enable accurate, efficient table understanding on mobile and edge devices.
Abstract
Table Visual Question Answering (Table VQA) is typically addressed by large vision-language models (VLMs). While such models can answer directly from images, they often miss fine-grained details unless scaled to very large sizes, which are computationally prohibitive, especially for mobile deployment. A lighter alternative is to have a small VLM perform OCR and then use a large language model (LLM) to reason over structured outputs such as Markdown tables. However, these representations are not naturally optimized for LLMs and still introduce substantial errors. We propose TALENT (Table VQA via Augmented Language-Enhanced Natural-text Transcription), a lightweight framework that leverages dual representations of tables. TALENT prompts a small VLM to produce both OCR text and natural language narration, then combines them with the question for reasoning by an LLM. This reframes Table VQA as an LLM-centric multimodal reasoning task, where the VLM serves as a perception-narration module rather than a monolithic solver. Additionally, we construct ReTabVQA, a more challenging Table VQA dataset requiring multi-step quantitative reasoning over table images. Experiments show that TALENT enables a small VLM-LLM combination to match or surpass a single large VLM at significantly lower computational cost on both public datasets and ReTabVQA.
