Table of Contents
Fetching ...

Language Modeling on Tabular Data: A Survey of Foundations, Techniques and Evolution

Yucheng Ruan, Xiang Lan, Jingying Ma, Yizhi Dong, Kai He, Mengling Feng

TL;DR

Tabular data pose diverse heterogeneities and complex relationships that challenge standard language models. The paper surveys foundations, data types, downstream tasks, and modeling techniques for tabular data, and traces the evolution from table-specific pre-training to the era of large language models. It provides a unified taxonomy of 1D and 2D data, catalogs representative datasets, and reviews challenges and future directions. The findings highlight that LLMs enable few-shot and zero-shot tabular reasoning and offer a roadmap for scalable, interpretable, and fair tabular AI systems.

Abstract

Tabular data, a prevalent data type across various domains, presents unique challenges due to its heterogeneous nature and complex structural relationships. Achieving high predictive performance and robustness in tabular data analysis holds significant promise for numerous applications. Influenced by recent advancements in natural language processing, particularly transformer architectures, new methods for tabular data modeling have emerged. Early techniques concentrated on pre-training transformers from scratch, often encountering scalability issues. Subsequently, methods leveraging pre-trained language models like BERT have been developed, which require less data and yield enhanced performance. The recent advent of large language models, such as GPT and LLaMA, has further revolutionized the field, facilitating more advanced and diverse applications with minimal fine-tuning. Despite the growing interest, a comprehensive survey of language modeling techniques for tabular data remains absent. This paper fills this gap by providing a systematic review of the development of language modeling for tabular data, encompassing: (1) a categorization of different tabular data structures and data types; (2) a review of key datasets used in model training and tasks used for evaluation; (3) a summary of modeling techniques including widely-adopted data processing methods, popular architectures, and training objectives; (4) the evolution from adapting traditional Pre-training/Pre-trained language models to the utilization of large language models; (5) an identification of persistent challenges and potential future research directions in language modeling for tabular data analysis. GitHub page associated with this survey is available at: https://github.com/lanxiang1017/Language-Modeling-on-Tabular-Data-Survey.git.

Language Modeling on Tabular Data: A Survey of Foundations, Techniques and Evolution

TL;DR

Tabular data pose diverse heterogeneities and complex relationships that challenge standard language models. The paper surveys foundations, data types, downstream tasks, and modeling techniques for tabular data, and traces the evolution from table-specific pre-training to the era of large language models. It provides a unified taxonomy of 1D and 2D data, catalogs representative datasets, and reviews challenges and future directions. The findings highlight that LLMs enable few-shot and zero-shot tabular reasoning and offer a roadmap for scalable, interpretable, and fair tabular AI systems.

Abstract

Tabular data, a prevalent data type across various domains, presents unique challenges due to its heterogeneous nature and complex structural relationships. Achieving high predictive performance and robustness in tabular data analysis holds significant promise for numerous applications. Influenced by recent advancements in natural language processing, particularly transformer architectures, new methods for tabular data modeling have emerged. Early techniques concentrated on pre-training transformers from scratch, often encountering scalability issues. Subsequently, methods leveraging pre-trained language models like BERT have been developed, which require less data and yield enhanced performance. The recent advent of large language models, such as GPT and LLaMA, has further revolutionized the field, facilitating more advanced and diverse applications with minimal fine-tuning. Despite the growing interest, a comprehensive survey of language modeling techniques for tabular data remains absent. This paper fills this gap by providing a systematic review of the development of language modeling for tabular data, encompassing: (1) a categorization of different tabular data structures and data types; (2) a review of key datasets used in model training and tasks used for evaluation; (3) a summary of modeling techniques including widely-adopted data processing methods, popular architectures, and training objectives; (4) the evolution from adapting traditional Pre-training/Pre-trained language models to the utilization of large language models; (5) an identification of persistent challenges and potential future research directions in language modeling for tabular data analysis. GitHub page associated with this survey is available at: https://github.com/lanxiang1017/Language-Modeling-on-Tabular-Data-Survey.git.
Paper Structure (41 sections, 7 figures, 4 tables)

This paper contains 41 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: The timeline of the evolution of language modeling on tabular data. Each model includes the following information from top to bottom: type of tabular data (1D or 2D), evaluation tasks, and the backbone model.
  • Figure 2: The structure of survey paper. It includes three main parts: foundations in tabular data, tabular data modelling techniques and evolution of language modelling on tabular data.
  • Figure 3: The illustration of 1D tabular data (left) and 2D tabular data (right). One row represents a sample in 1D tabular data while one tabular table corresponds to a sample in 2D tabular data.
  • Figure 4: The taxonomy of input processing. It contains data retrieval, table serialization and content integration.
  • Figure 5: The illustration of flattened sequence for 1D (down) and 2D (up) data in table serialization.
  • ...and 2 more figures