Tabular Representation, Noisy Operators, and Impacts on Table Structure Understanding Tasks in LLMs

Ananya Singha; José Cambronero; Sumit Gulwani; Vu Le; Chris Parnin

Tabular Representation, Noisy Operators, and Impacts on Table Structure Understanding Tasks in LLMs

Ananya Singha, José Cambronero, Sumit Gulwani, Vu Le, Chris Parnin

TL;DR

The paper systematically analyzes how tabular representation formats and eight real-world-inspired noise operations affect LLMs' ability to perform self-supervised table-structure tasks via in-context learning. It introduces a broad evaluation framework across eight formats and eight noise types, assessing both fact-finding and transformation tasks on seven Kaggle datasets using GPT-3. Key findings show that DFLoader and JSON formats typically yield the best performance for most tasks, while noise can both improve and degrade results depending on the task- format combination. The work highlights format brittleness in LLMs and motivates further multi-LLM studies and investigations into how table-structure robustness translates to downstream table tasks. Overall, the paper provides practical guidance for prompt design and data-preparation when working with tabular data in LLM applications.

Abstract

Large language models (LLMs) are increasingly applied for tabular tasks using in-context learning. The prompt representation for a table may play a role in the LLMs ability to process the table. Inspired by prior work, we generate a collection of self-supervised structural tasks (e.g. navigate to a cell and row; transpose the table) and evaluate the performance differences when using 8 formats. In contrast to past work, we introduce 8 noise operations inspired by real-world messy data and adversarial inputs, and show that such operations can impact LLM performance across formats for different structural understanding tasks.

Tabular Representation, Noisy Operators, and Impacts on Table Structure Understanding Tasks in LLMs

TL;DR

Abstract

Paper Structure (13 sections, 3 figures, 4 tables)

This paper contains 13 sections, 3 figures, 4 tables.

Introduction
Related Work
Methodology
Table and Formats
Noise Operations
Self-Supervised Structural Tasks
Experimental Setup
Experimental Setup
Experimental Setup
Results
RQ1: Impact of Formats on LLMs performance on Different Task
RQ2: Impact of Noise Operations on LLM's Performance on Structural Tasks.
Conclusion

Figures (3)

Figure 1: Our evaluation considers 8 different table representation formats that are popular in the data science domain.
Figure 2: We apply eight different noise operations to test for the influence of spatial invariance, header rows information, and the presence of semi-structured content on structural table task performance.
Figure 3: We generate self-supervised structural table understanding tasks: fact-finding tasks (e.g. navigation) and transformation tasks (e.g. table transposition).

Tabular Representation, Noisy Operators, and Impacts on Table Structure Understanding Tasks in LLMs

TL;DR

Abstract

Tabular Representation, Noisy Operators, and Impacts on Table Structure Understanding Tasks in LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (3)