TraceBack: Multi-Agent Decomposition for Fine-Grained Table Attribution
Tejas Anvekar, Junha Park, Rajat Jha, Devanshu Gupta, Poojah Ganesan, Puneeth Mathur, Vivek Gupta
TL;DR
TraceBack presents a scalable, multi-agent framework for fine-grained cell-level attribution in single-table QA. By combining schema-aware pruning, evidence grounding, question decomposition, and precise sub-question attribution, it achieves state-of-the-art performance across row-, column-, and cell-level attribution on diverse tabular datasets. To enable scalable evaluation, the authors introduce CITEBench with phrase-aligned gold annotations and FAIRScore, a reference-free metric that correlates with human judgments. Together, these contributions improve transparency, trust, and evaluability of table-based QA in high-stakes settings. The work lays a foundation for interpretable, scalable grounding in structured data QA and points to future extensions to multi-table reasoning and broader domains.
Abstract
Question answering (QA) over structured tables requires not only accurate answers but also transparency about which cells support them. Existing table QA systems rarely provide fine-grained attribution, so even correct answers often lack verifiable grounding, limiting trust in high-stakes settings. We address this with TraceBack, a modular multi-agent framework for scalable, cell-level attribution in single-table QA. TraceBack prunes tables to relevant rows and columns, decomposes questions into semantically coherent sub-questions, and aligns each answer span with its supporting cells, capturing both explicit and implicit evidence used in intermediate reasoning steps. To enable systematic evaluation, we release CITEBench, a benchmark with phrase-to-cell annotations drawn from ToTTo, FetaQA, and AITQA. We further propose FairScore, a reference-less metric that compares atomic facts derived from predicted cells and answers to estimate attribution precision and recall without human cell labels. Experiments show that TraceBack substantially outperforms strong baselines across datasets and granularities, while FairScore closely tracks human judgments and preserves relative method rankings, supporting interpretable and scalable evaluation of table-based QA.
