Table of Contents
Fetching ...

Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study

Yuan Sui, Mengyu Zhou, Mingjie Zhou, Shi Han, Dongmei Zhang

TL;DR

This paper tackles whether large language models can truly understand structured table data by introducing the SUC benchmark to evaluate seven structural-understanding tasks and proposing self-augmented prompting to exploit internal model knowledge. It systematically analyzes how input designs (formats, prompts, and ordering) affect performance across tabular reasoning tasks, showing HTML markup with format explanations and role prompts as leading configurations. The authors demonstrate that careful input design, plus self-augmentation, yields consistent improvements on downstream tabular tasks (e.g., TabFact, HybridQA, SQA, Feverous, ToTTo) and provide actionable guidelines for designing prompts and representations. By releasing an open-source benchmark and showing practical gains, the work offers a scalable, model-agnostic path to better leveraging LLMs for structured data understanding in real-world applications.

Abstract

Large language models (LLMs) are becoming attractive as few-shot reasoners to solve Natural Language (NL)-related tasks. However, the understanding of their capability to process structured data like tables remains an under-explored area. While tables can be serialized as input for LLMs, there is a lack of comprehensive studies on whether LLMs genuinely comprehend this data. In this paper, we try to understand this by designing a benchmark to evaluate the structural understanding capabilities of LLMs through seven distinct tasks, e.g., cell lookup, row retrieval and size detection. Specially, we perform a series of evaluations on the recent most advanced LLM models, GPT-3.5 and GPT-4 and observe that performance varied with different input choices, including table input format, content order, role prompting, and partition marks. Drawing from the insights gained through the benchmark evaluations, we propose $\textit{self-augmentation}$ for effective structural prompting, such as critical value / range identification using internal knowledge of LLMs. When combined with carefully chosen input choices, these structural prompting methods lead to promising improvements in LLM performance on a variety of tabular tasks, e.g., TabFact($\uparrow2.31\%$), HybridQA($\uparrow2.13\%$), SQA($\uparrow2.72\%$), Feverous($\uparrow0.84\%$), and ToTTo($\uparrow5.68\%$). We believe that our open source benchmark and proposed prompting methods can serve as a simple yet generic selection for future research. The code and data of this paper will be temporality released at https://anonymous.4open.science/r/StructuredLLM-76F3/README.md and will be replaced with an official one at https://github.com/microsoft/TableProvider later.

Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study

TL;DR

This paper tackles whether large language models can truly understand structured table data by introducing the SUC benchmark to evaluate seven structural-understanding tasks and proposing self-augmented prompting to exploit internal model knowledge. It systematically analyzes how input designs (formats, prompts, and ordering) affect performance across tabular reasoning tasks, showing HTML markup with format explanations and role prompts as leading configurations. The authors demonstrate that careful input design, plus self-augmentation, yields consistent improvements on downstream tabular tasks (e.g., TabFact, HybridQA, SQA, Feverous, ToTTo) and provide actionable guidelines for designing prompts and representations. By releasing an open-source benchmark and showing practical gains, the work offers a scalable, model-agnostic path to better leveraging LLMs for structured data understanding in real-world applications.

Abstract

Large language models (LLMs) are becoming attractive as few-shot reasoners to solve Natural Language (NL)-related tasks. However, the understanding of their capability to process structured data like tables remains an under-explored area. While tables can be serialized as input for LLMs, there is a lack of comprehensive studies on whether LLMs genuinely comprehend this data. In this paper, we try to understand this by designing a benchmark to evaluate the structural understanding capabilities of LLMs through seven distinct tasks, e.g., cell lookup, row retrieval and size detection. Specially, we perform a series of evaluations on the recent most advanced LLM models, GPT-3.5 and GPT-4 and observe that performance varied with different input choices, including table input format, content order, role prompting, and partition marks. Drawing from the insights gained through the benchmark evaluations, we propose for effective structural prompting, such as critical value / range identification using internal knowledge of LLMs. When combined with carefully chosen input choices, these structural prompting methods lead to promising improvements in LLM performance on a variety of tabular tasks, e.g., TabFact(), HybridQA(), SQA(), Feverous(), and ToTTo(). We believe that our open source benchmark and proposed prompting methods can serve as a simple yet generic selection for future research. The code and data of this paper will be temporality released at https://anonymous.4open.science/r/StructuredLLM-76F3/README.md and will be replaced with an official one at https://github.com/microsoft/TableProvider later.
Paper Structure (27 sections, 3 figures, 9 tables)

This paper contains 27 sections, 3 figures, 9 tables.

Figures (3)

  • Figure 1: SUC Benchmark Overview
  • Figure 2: Input Designs for SUC Evaluation
  • Figure 3: Illustration of self-augmented prompting. This process consists of two phases: 1) using self-augmented prompts to ask the LLM to generate additional knowledge (intermediate output) about the table; 2) incorporating the self-augmented response into the second prompt to request the final answer for a downstream task. As depicted in the figure, the LLM is able to identify important values in the table, which assists in generating a more accurate answer for the downstream task.