Table of Contents
Fetching ...

NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables

Lanrui Wang, Mingyu Zheng, Hongyin Tang, Zheng Lin, Yanan Cao, Jingang Wang, Xunliang Cai, Weiping Wang

TL;DR

This paper introduces NeedleInATable (NIAT), a first long-context benchmark for structured tables that treats each cell as a needle and requires models to locate or lookup target cells. It reveals a substantial gap between NIAT and traditional downstream tabular tasks, indicating that current models rely on shortcuts rather than robust long-context table understanding. The authors present a strong2weak NIAT data synthesis approach using GPT-4o to generate 12K NIAT examples with chain-of-thought targets, which, when used to fine-tune models, yields notable improvements on NIAT and downstream benchmarks (e.g., ≈3.8% NIAT and ≈14.6% downstream gains). Overall, NIAT provides a foundation for evaluating and strengthening the true long-context comprehension of long-structured tables in LLMs and multimodal LLMs, with practical implications for robust table reasoning.

Abstract

Processing structured tabular data, particularly large and lengthy tables, constitutes a fundamental yet challenging task for large language models (LLMs). However, existing long-context benchmarks like Needle-in-a-Haystack primarily focus on unstructured text, neglecting the challenge of diverse structured tables. Meanwhile, previous tabular benchmarks mainly consider downstream tasks that require high-level reasoning abilities, and overlook models' underlying fine-grained perception of individual table cells, which is crucial for practical and robust LLM-based table applications. To address this gap, we introduce \textsc{NeedleInATable} (NIAT), a new long-context tabular benchmark that treats each table cell as a ``needle'' and requires models to extract the target cell based on cell locations or lookup questions. Our comprehensive evaluation of various LLMs and multimodal LLMs reveals a substantial performance gap between popular downstream tabular tasks and the simpler NIAT task, suggesting that they may rely on dataset-specific correlations or shortcuts to obtain better benchmark results but lack truly robust long-context understanding towards structured tables. Furthermore, we demonstrate that using synthesized NIAT training data can effectively improve performance on both NIAT task and downstream tabular tasks, which validates the importance of NIAT capability for LLMs' genuine table understanding ability.

NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables

TL;DR

This paper introduces NeedleInATable (NIAT), a first long-context benchmark for structured tables that treats each cell as a needle and requires models to locate or lookup target cells. It reveals a substantial gap between NIAT and traditional downstream tabular tasks, indicating that current models rely on shortcuts rather than robust long-context table understanding. The authors present a strong2weak NIAT data synthesis approach using GPT-4o to generate 12K NIAT examples with chain-of-thought targets, which, when used to fine-tune models, yields notable improvements on NIAT and downstream benchmarks (e.g., ≈3.8% NIAT and ≈14.6% downstream gains). Overall, NIAT provides a foundation for evaluating and strengthening the true long-context comprehension of long-structured tables in LLMs and multimodal LLMs, with practical implications for robust table reasoning.

Abstract

Processing structured tabular data, particularly large and lengthy tables, constitutes a fundamental yet challenging task for large language models (LLMs). However, existing long-context benchmarks like Needle-in-a-Haystack primarily focus on unstructured text, neglecting the challenge of diverse structured tables. Meanwhile, previous tabular benchmarks mainly consider downstream tasks that require high-level reasoning abilities, and overlook models' underlying fine-grained perception of individual table cells, which is crucial for practical and robust LLM-based table applications. To address this gap, we introduce \textsc{NeedleInATable} (NIAT), a new long-context tabular benchmark that treats each table cell as a ``needle'' and requires models to extract the target cell based on cell locations or lookup questions. Our comprehensive evaluation of various LLMs and multimodal LLMs reveals a substantial performance gap between popular downstream tabular tasks and the simpler NIAT task, suggesting that they may rely on dataset-specific correlations or shortcuts to obtain better benchmark results but lack truly robust long-context understanding towards structured tables. Furthermore, we demonstrate that using synthesized NIAT training data can effectively improve performance on both NIAT task and downstream tabular tasks, which validates the importance of NIAT capability for LLMs' genuine table understanding ability.

Paper Structure

This paper contains 37 sections, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Comparison of previous long-context benchmarks, tabular benchmarks and the proposed NIAT benchmark. Existing long-context benchmarks overlook the structured tabular data, while traditional tabular benchmarks mainly focus on high-level complex reasoning ability. Both of them ignore the model’s basic fine-grained comprehension of individual cells in the table context.
  • Figure 2: The construction pipeline of Needle-in-a-Table Benchmark.
  • Figure 3: The illustration of LLMs' attention patterns for structured tables. The input tables are in Markdown format and '(m,n)' indicates cell tokens in the m-th row and n-th column.
  • Figure 4: The per-cell accuracy heat map of cell-locating task on tables of fixed sizes, with redder indicating lower accuracy and greener indicating higher accuracy. The x-axis represents tables of different sizes (e.g., 8 × 8 denotes tables with 8 rows and 8 columns), while the y-axis lists evaluated LLMs in instruct version. + ours denotes models fine-tuned with our synthesized NIAT training data.
  • Figure 5: The distribution of input lengths for our proposed NIAT benchmark. The tokenizer of Llama3.1-8B-Instruct is adopted to calculate the token length.
  • ...and 8 more figures