Table of Contents
Fetching ...

Table as a Modality for Large Language Models

Liyao Li, Chao Ye, Wentao Ye, Yifei Sun, Zhe Jiang, Haobo Wang, Jiaming Tian, Yiming Zhang, Ningtao Wang, Xing Fu, Gang Chen, Junbo Zhao

TL;DR

The paper tackles the difficulty of tabular reasoning with LLMs due to loss of structural information when tables are serialized as text. It introduces TaMo, a multimodal framework that encodes table structure with a hypergraph transformer and integrates it with LLMs via a learnable interface, achieving substantial generalization gains. A new StructQA benchmark assesses table-structure understanding and permutation invariance, showing current text-only approaches underperform. Across five table-reasoning datasets, TaMo delivers large improvements over text baselines and competitive performance with specialist SOTA methods, illustrating the value of explicitly modeling table structure as a separate modality for robust tabular QA.

Abstract

To migrate the remarkable successes of Large Language Models (LLMs), the community has made numerous efforts to generalize them to the table reasoning tasks for the widely deployed tabular data. Despite that, in this work, by showing a probing experiment on our proposed StructQA benchmark, we postulate that even the most advanced LLMs (such as GPTs) may still fall short of coping with tabular data. More specifically, the current scheme often simply relies on serializing the tabular data, together with the meta information, then inputting them through the LLMs. We argue that the loss of structural information is the root of this shortcoming. In this work, we further propose TAMO, which bears an ideology to treat the tables as an independent modality integrated with the text tokens. The resulting model in TAMO is a multimodal framework consisting of a hypergraph neural network as the global table encoder seamlessly integrated with the mainstream LLM. Empirical results on various benchmarking datasets, including HiTab, WikiTQ, WikiSQL, FeTaQA, and StructQA, have demonstrated significant improvements on generalization with an average relative gain of 42.65%.

Table as a Modality for Large Language Models

TL;DR

The paper tackles the difficulty of tabular reasoning with LLMs due to loss of structural information when tables are serialized as text. It introduces TaMo, a multimodal framework that encodes table structure with a hypergraph transformer and integrates it with LLMs via a learnable interface, achieving substantial generalization gains. A new StructQA benchmark assesses table-structure understanding and permutation invariance, showing current text-only approaches underperform. Across five table-reasoning datasets, TaMo delivers large improvements over text baselines and competitive performance with specialist SOTA methods, illustrating the value of explicitly modeling table structure as a separate modality for robust tabular QA.

Abstract

To migrate the remarkable successes of Large Language Models (LLMs), the community has made numerous efforts to generalize them to the table reasoning tasks for the widely deployed tabular data. Despite that, in this work, by showing a probing experiment on our proposed StructQA benchmark, we postulate that even the most advanced LLMs (such as GPTs) may still fall short of coping with tabular data. More specifically, the current scheme often simply relies on serializing the tabular data, together with the meta information, then inputting them through the LLMs. We argue that the loss of structural information is the root of this shortcoming. In this work, we further propose TAMO, which bears an ideology to treat the tables as an independent modality integrated with the text tokens. The resulting model in TAMO is a multimodal framework consisting of a hypergraph neural network as the global table encoder seamlessly integrated with the mainstream LLM. Empirical results on various benchmarking datasets, including HiTab, WikiTQ, WikiSQL, FeTaQA, and StructQA, have demonstrated significant improvements on generalization with an average relative gain of 42.65%.

Paper Structure

This paper contains 36 sections, 5 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Current tabular LLMs oversimplifies tables into text sequences, ignoring structured information and hindering basic table cell localization tasks. This work is the first to direct table structure integration into LLMs.
  • Figure 2: We evaluated LLMs' understanding of table structures through our StructQA dataset (Section \ref{['sec:structqa']}). The evaluation focused on permutation invariance by assessing answer robustness across randomly permuted rows and columns. Results show that TaMo achieves state-of-the-art performance, comparable with GPT-4.
  • Figure 3: An example of converting arbitrary simple or complex tables into hypergraphs. A simple flat table is a special case of the complex hierarchical table. A hyperedge (e.g., table headers) in the hypergraph is a set of regular nodes. We construct the corresponding hypergraph format according to the hierarchical relationships of the table.
  • Figure 4: The proposed framework for tabular LLMs, TaMo. Given a table input, the hypergraph-enhanced table encoder (Section \ref{['tabular encoder']}) is used to capture the unique structure properties of the table modality. Simultaneously, we serialize the original table into a formatted text sequence. Finally, we input both the table structure and textual embeddings into LLMs, generating answers using the next token prediction paradigm. LoRA is optional.
  • Figure 5: A real visualization case in the WikiSQL dataset results of attention weights from other input tokens to the label answer cell "Canada". Intuitively, the darker the color, the more closely the token is associated with "Canada". We observe that with the "[table_structure_token]" of TaMo, the LLM better focuses on information relevant to the correct answer, as indicated by the darker background colors associated with those tokens.
  • ...and 3 more figures