Enhancing Large Vision-Language Models with Layout Modality for Table Question Answering on Japanese Annual Securities Reports
Hayato Aida, Kosuke Takahashi, Takahiro Omi
TL;DR
The paper tackles the challenge of table question answering in Japanese annual securities reports by introducing TableCellQA, which reframes the task as exact cell-value extraction and evaluates it with a multimodal LVLM incorporating text and layout alongside images. By decomposing HTML tables into image, text, and layout modalities and encoding layout with a two-layer MLP, the approach demonstrates that text and layout cues substantially improve accuracy and robustness on real-world, unstructured documents. Key findings include the dominance of textual content for accuracy, the complementary value of layout information for preserving table structure, and the viability of LVLM-based multimodal reasoning as a practical substitute when explicit structured tables are unavailable. The work also shows that specialized pretraining on layout information may not always improve performance, suggesting strong generalization from standard LVLM pretraining and highlighting opportunities to extend to more complex mixed-content documents in real-world business settings.
Abstract
With recent advancements in Large Language Models (LLMs) and growing interest in retrieval-augmented generation (RAG), the ability to understand table structures has become increasingly important. This is especially critical in financial domains such as securities reports, where highly accurate question answering (QA) over tables is required. However, tables exist in various formats-including HTML, images, and plain text-making it difficult to preserve and extract structural information. Therefore, multimodal LLMs are essential for robust and general-purpose table understanding. Despite their promise, current Large Vision-Language Models (LVLMs), which are major representatives of multimodal LLMs, still face challenges in accurately understanding characters and their spatial relationships within documents. In this study, we propose a method to enhance LVLM-based table understanding by incorporating in-table textual content and layout features. Experimental results demonstrate that these auxiliary modalities significantly improve performance, enabling robust interpretation of complex document layouts without relying on explicitly structured input formats.
