Table of Contents
Fetching ...

Unleashing the Potential of Large Language Models for Predictive Tabular Tasks in Data Science

Yazheng Yang, Yuqi Wang, Yaxuan Li, Sankalok Sen, Lei Li, Qi Liu

TL;DR

This work demonstrates that large language models can be effectively adapted to predictive tabular tasks by pretraining an LLM on a large, table-focused corpus and aligning it with task instructions through unified Markdown serialization and a Mask-Then-Predict objective, followed by downstream multi-task fine-tuning. The approach yields substantial improvements over strong baselines across classification, regression, and missing-value imputation, and showcases strong zero-shot, few-shot, and extremely long-context capabilities. The results establish a new benchmark for tabular intelligence and highlight the practical potential of tailored LLM pretraining for data science workflows.

Abstract

In the domain of data science, the predictive tasks of classification, regression, and imputation of missing values are commonly encountered challenges associated with tabular data. This research endeavors to apply Large Language Models (LLMs) towards addressing these predictive tasks. Despite their proficiency in comprehending natural language, LLMs fall short in dealing with structured tabular data. This limitation stems from their lacking exposure to the intricacies of tabular data during their foundational training. Our research aims to mitigate this gap by compiling a comprehensive corpus of tables annotated with instructions and executing large-scale training of Llama-2 on this enriched dataset. Furthermore, we investigate the practical application of applying the trained model to zero-shot prediction, few-shot prediction, and in-context learning scenarios. Through extensive experiments, our methodology has shown significant improvements over existing benchmarks. These advancements highlight the efficacy of tailoring LLM training to solve table-related problems in data science, thereby establishing a new benchmark in the utilization of LLMs for enhancing tabular intelligence.

Unleashing the Potential of Large Language Models for Predictive Tabular Tasks in Data Science

TL;DR

This work demonstrates that large language models can be effectively adapted to predictive tabular tasks by pretraining an LLM on a large, table-focused corpus and aligning it with task instructions through unified Markdown serialization and a Mask-Then-Predict objective, followed by downstream multi-task fine-tuning. The approach yields substantial improvements over strong baselines across classification, regression, and missing-value imputation, and showcases strong zero-shot, few-shot, and extremely long-context capabilities. The results establish a new benchmark for tabular intelligence and highlight the practical potential of tailored LLM pretraining for data science workflows.

Abstract

In the domain of data science, the predictive tasks of classification, regression, and imputation of missing values are commonly encountered challenges associated with tabular data. This research endeavors to apply Large Language Models (LLMs) towards addressing these predictive tasks. Despite their proficiency in comprehending natural language, LLMs fall short in dealing with structured tabular data. This limitation stems from their lacking exposure to the intricacies of tabular data during their foundational training. Our research aims to mitigate this gap by compiling a comprehensive corpus of tables annotated with instructions and executing large-scale training of Llama-2 on this enriched dataset. Furthermore, we investigate the practical application of applying the trained model to zero-shot prediction, few-shot prediction, and in-context learning scenarios. Through extensive experiments, our methodology has shown significant improvements over existing benchmarks. These advancements highlight the efficacy of tailoring LLM training to solve table-related problems in data science, thereby establishing a new benchmark in the utilization of LLMs for enhancing tabular intelligence.
Paper Structure (15 sections, 14 figures, 7 tables)

This paper contains 15 sections, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Illustration of our methodology for the training of Large Language Models (LLMs) with tables and the subsequent application of our model to downstream tasks.
  • Figure 2: Illustration of the initial pretraining phase of a LLM applying the Mask-Then-Predict strategy (on the left), followed by the multi-task training phase customized for downstream tasks such as classification and regression (on the right). Through the former phase, the LLM acquires unstructured knowledge embedded within tables. Subsequently, during the latter phase, it enhances its capability for reasoning between instructions and tabular contents.
  • Figure 3: The unified prompt template used for combining the instruction with tables to form the model input in both pretraining and finetuning in downstream tasks.
  • Figure 4: The domain distribution: the percentages of the top-32 domains of tables collected from Kaggle. The tables that we collect cover around 300 domains.
  • Figure 5: The data type distribution: the percentages of numerical columns and textual columns in our collected Kaggle tables.
  • ...and 9 more figures