Table of Contents
Fetching ...

TQA-Bench: Evaluating LLMs for Multi-Table Question Answering with Scalable Context and Symbolic Extension

Zipeng Qiu, You Peng, Guangxin He, Binhang Yuan, Chen Wang

TL;DR

TQA-Bench addresses the critical gap in evaluating LLMs on multi-table QA by introducing a scalable benchmark built on real-world relational datasets with varied context lengths from $8K$ to $64K$ tokens. It combines a GSM-inspired symbolic extension with a rigorous sampling pipeline (topological-order row sampling) to test higher-order reasoning beyond retrieval. The evaluation covers 22 LLMs spanning open- and closed-source families, with models ranging from $2B$ to $72B$ parameters and context windows up to $128K$, revealing that Markdown serialization generally outperforms CSV and that instruct models tend to outperform chat-oriented ones, especially as context length grows. The work provides a reproducible data-generation pipeline and insights into how serialization format, context length, and symbolic reasoning impact multi-table QA, offering a robust resource for advancing LLM capabilities in real-world, data-centric tasks.

Abstract

The advent of large language models (LLMs) has unlocked great opportunities in complex data management tasks, particularly in question answering (QA) over complicated multi-table relational data. Despite significant progress, systematically evaluating LLMs on multi-table QA remains a critical challenge due to the inherent complexity of analyzing heterogeneous table structures and potential large scale of serialized relational data. Existing benchmarks primarily focus on single-table QA, failing to capture the intricacies of reasoning across multiple relational tables, as required in real-world domains such as finance, healthcare, and e-commerce. To address this gap, we present TQA-Bench, a new multi-table QA benchmark designed to evaluate the capabilities of LLMs in tackling complex QA tasks over relational data. Our benchmark incorporates diverse relational database instances sourced from real-world public datasets and introduces a flexible sampling mechanism to create tasks with varying multi-table context lengths, ranging from 8K to 64K tokens. To ensure robustness and reliability, we integrate symbolic extensions into the evaluation framework, enabling the assessment of LLM reasoning capabilities beyond simple data retrieval or probabilistic pattern matching. We systematically evaluate a range of LLMs, both open-source and closed-source, spanning model scales from 7 billion to 70 billion parameters. Our extensive experiments reveal critical insights into the performance of LLMs in multi-table QA, highlighting both challenges and opportunities for advancing their application in complex, data-driven environments. Our benchmark implementation and results are available at https://github.com/Relaxed-System-Lab/TQA-Bench.

TQA-Bench: Evaluating LLMs for Multi-Table Question Answering with Scalable Context and Symbolic Extension

TL;DR

TQA-Bench addresses the critical gap in evaluating LLMs on multi-table QA by introducing a scalable benchmark built on real-world relational datasets with varied context lengths from to tokens. It combines a GSM-inspired symbolic extension with a rigorous sampling pipeline (topological-order row sampling) to test higher-order reasoning beyond retrieval. The evaluation covers 22 LLMs spanning open- and closed-source families, with models ranging from to parameters and context windows up to , revealing that Markdown serialization generally outperforms CSV and that instruct models tend to outperform chat-oriented ones, especially as context length grows. The work provides a reproducible data-generation pipeline and insights into how serialization format, context length, and symbolic reasoning impact multi-table QA, offering a robust resource for advancing LLM capabilities in real-world, data-centric tasks.

Abstract

The advent of large language models (LLMs) has unlocked great opportunities in complex data management tasks, particularly in question answering (QA) over complicated multi-table relational data. Despite significant progress, systematically evaluating LLMs on multi-table QA remains a critical challenge due to the inherent complexity of analyzing heterogeneous table structures and potential large scale of serialized relational data. Existing benchmarks primarily focus on single-table QA, failing to capture the intricacies of reasoning across multiple relational tables, as required in real-world domains such as finance, healthcare, and e-commerce. To address this gap, we present TQA-Bench, a new multi-table QA benchmark designed to evaluate the capabilities of LLMs in tackling complex QA tasks over relational data. Our benchmark incorporates diverse relational database instances sourced from real-world public datasets and introduces a flexible sampling mechanism to create tasks with varying multi-table context lengths, ranging from 8K to 64K tokens. To ensure robustness and reliability, we integrate symbolic extensions into the evaluation framework, enabling the assessment of LLM reasoning capabilities beyond simple data retrieval or probabilistic pattern matching. We systematically evaluate a range of LLMs, both open-source and closed-source, spanning model scales from 7 billion to 70 billion parameters. Our extensive experiments reveal critical insights into the performance of LLMs in multi-table QA, highlighting both challenges and opportunities for advancing their application in complex, data-driven environments. Our benchmark implementation and results are available at https://github.com/Relaxed-System-Lab/TQA-Bench.

Paper Structure

This paper contains 15 sections, 5 figures, 6 tables, 2 algorithms.

Figures (5)

  • Figure 1: Symbolic extension formation in the "airline" database.
  • Figure 2: Evaluation prompt template.
  • Figure 3: The overall accuracy of all models.
  • Figure 4: The accuracy distribution of question subcategories in different context lengths.
  • Figure 5: The accuracy distribution of question instances in context length 8K, airline database between four models.