Table of Contents
Fetching ...

Large Language Models(LLMs) on Tabular Data: Prediction, Generation, and Understanding -- A Survey

Xi Fang, Weijie Xu, Fiona Anting Tan, Jiani Zhang, Ziqing Hu, Yanjun Qi, Scott Nickleach, Diego Socolinsky, Srinivasan Sengamedu, Christos Faloutsos

TL;DR

This paper surveys how large language models can be applied to tabular data across prediction, generation, and understanding tasks. It provides a taxonomy of techniques, datasets, metrics, and methodologies, spanning serialization, table manipulation, prompting, and end-to-end systems, with cross-task insights. It documents a broad landscape of methods for tabular prediction, data synthesis, and table QA, and discusses practical limitations such as hallucination, bias, and interpretability while offering directions for standardized benchmarks and tokenizer improvements. The work aims to guide researchers and practitioners in selecting approaches and datasets, and to illuminate promising avenues for future research in tabular data modeling with LLMs.

Abstract

Recent breakthroughs in large language modeling have facilitated rigorous exploration of their application in diverse tasks related to tabular data modeling, such as prediction, tabular data synthesis, question answering, and table understanding. Each task presents unique challenges and opportunities. However, there is currently a lack of comprehensive review that summarizes and compares the key techniques, metrics, datasets, models, and optimization approaches in this research domain. This survey aims to address this gap by consolidating recent progress in these areas, offering a thorough survey and taxonomy of the datasets, metrics, and methodologies utilized. It identifies strengths, limitations, unexplored territories, and gaps in the existing literature, while providing some insights for future research directions in this vital and rapidly evolving field. It also provides relevant code and datasets references. Through this comprehensive review, we hope to provide interested readers with pertinent references and insightful perspectives, empowering them with the necessary tools and knowledge to effectively navigate and address the prevailing challenges in the field.

Large Language Models(LLMs) on Tabular Data: Prediction, Generation, and Understanding -- A Survey

TL;DR

This paper surveys how large language models can be applied to tabular data across prediction, generation, and understanding tasks. It provides a taxonomy of techniques, datasets, metrics, and methodologies, spanning serialization, table manipulation, prompting, and end-to-end systems, with cross-task insights. It documents a broad landscape of methods for tabular prediction, data synthesis, and table QA, and discusses practical limitations such as hallucination, bias, and interpretability while offering directions for standardized benchmarks and tokenizer improvements. The work aims to guide researchers and practitioners in selecting approaches and datasets, and to illuminate promising avenues for future research in tabular data modeling with LLMs.

Abstract

Recent breakthroughs in large language modeling have facilitated rigorous exploration of their application in diverse tasks related to tabular data modeling, such as prediction, tabular data synthesis, question answering, and table understanding. Each task presents unique challenges and opportunities. However, there is currently a lack of comprehensive review that summarizes and compares the key techniques, metrics, datasets, models, and optimization approaches in this research domain. This survey aims to address this gap by consolidating recent progress in these areas, offering a thorough survey and taxonomy of the datasets, metrics, and methodologies utilized. It identifies strengths, limitations, unexplored territories, and gaps in the existing literature, while providing some insights for future research directions in this vital and rapidly evolving field. It also provides relevant code and datasets references. Through this comprehensive review, we hope to provide interested readers with pertinent references and insightful perspectives, empowering them with the necessary tools and knowledge to effectively navigate and address the prevailing challenges in the field.
Paper Structure (53 sections, 6 figures, 8 tables)

This paper contains 53 sections, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Overview of LLM on Tabular Data: the paper discusses application of LLM for prediction, data generation, and table understanding tasks
  • Figure 2: Tabular data characteristics and machine learning models for tabular data prediction, data synthesis and table understanding like question answering before LLMs.
  • Figure 3: Development of language models and their applications in tabular data modeling.
  • Figure 4: Key techniques in using LLMs for tabular data. The dotted line indicates steps that are optional.
  • Figure 5: The data generation process for causual LMs
  • ...and 1 more figures