Doc2SoarGraph: Discrete Reasoning over Visually-Rich Table-Text Documents via Semantic-Oriented Hierarchical Graphs
Fengbin Zhu, Chao Wang, Fuli Feng, Zifeng Ren, Moxin Li, Tat-Seng Chua
TL;DR
Doc2SoarGraph tackles the challenging problem of discrete reasoning over visually-rich table-text documents in the TAT-DQA setting by modeling element-level semantics with semantic-oriented hierarchical graphs. It defines four node types (Question, Block, Quantity, Date), constructs four graphs ($G_{QC}$, $G_{DC}$, $G_{TR}$, $G_{SD}$), and learns representations with per-graph GCNs, followed by evidence-based node selection and multi-type answer generation, including a tree-based Arithmetic decoder. The training objective combines multiple losses ($\mathcal{L}=\mathcal{L}_{node}+\mathcal{L}_{tree}+\mathcal{L}_{start}+\mathcal{L}_{end}+\mathcal{L}_{type}+\mathcal{L}_{token}+\mathcal{L}_{scale}$) to jointly supervise evidence selection, reasoning, and answer synthesis. Empirical results on the TAT-DQA dataset show substantial gains over MHST and zero-shot LLMs, with notable improvements on Arithmetic questions and evidence extraction, establishing a new state-of-the-art and highlighting the practical value for real-world finance document QA. The work advances robust, document-centric discrete reasoning and provides open-source code to promote reproducibility and broader adoption.
Abstract
Discrete reasoning over table-text documents (e.g., financial reports) gains increasing attention in recent two years. Existing works mostly simplify this challenge by manually selecting and transforming document pages to structured tables and paragraphs, hindering their practical application. In this work, we explore a more realistic problem setting in the form of TAT-DQA, i.e. to answer the question over a visually-rich table-text document. Specifically, we propose a novel Doc2SoarGraph framework with enhanced discrete reasoning capability by harnessing the differences and correlations among different elements (e.g., quantities, dates) of the given question and document with Semantic-oriented hierarchical Graph structures. We conduct extensive experiments on TAT-DQA dataset, and the results show that our proposed framework outperforms the best baseline model by 17.73% and 16.91% in terms of Exact Match (EM) and F1 score respectively on the test set, achieving the new state-of-the-art.
