Table of Contents
Fetching ...

AnnotatedTables: A Large Tabular Dataset with Language Model Annotations

Yaojie Hu, Ilias Fountalis, Jin Tian, Nikolaos Vasiloglou

TL;DR

This work tackles the bottleneck of annotating tabular data for machine learning by leveraging large language models to synthesize scalable, executable SQL annotations and input-target columns. It introduces AnnotatedTables, the largest SQL dataset with tabular data that supports query execution (32,119 databases and 405,616 valid SQL programs), and demonstrates two validation studies: translating SQL to a new language Rel via Incremental Prompt Engineering and evaluating TabPFN on a broad, real-world suite of tables annotated by LLMs. The results show substantial but imperfect LLM annotation quality (82.25% valid SQL, ~40% SQL-to-Rel execution accuracy), with clear insights into when LLMs excel and where limitations arise. Overall, the work demonstrates that LLM-driven annotation can dramatically reduce human labor in building large, diverse tabular datasets and offers a flexible framework for targeted research in database language translation and tabular classification.

Abstract

Tabular data is ubiquitous in real-world applications and abundant on the web, yet its annotation has traditionally required human labor, posing a significant scalability bottleneck for tabular machine learning. Our methodology can successfully annotate a large amount of tabular data and can be flexibly steered to generate various types of annotations based on specific research objectives, as we demonstrate with SQL annotation and input-target column annotation as examples. As a result, we release AnnotatedTables, a collection of 32,119 databases with LLM-generated annotations. The dataset includes 405,616 valid SQL programs, making it the largest SQL dataset with associated tabular data that supports query execution. To further demonstrate the value of our methodology and dataset, we perform two follow-up research studies. 1) We investigate whether LLMs can translate SQL programs to Rel programs, a database language previously unknown to LLMs, while obtaining the same execution results. Using our Incremental Prompt Engineering methods based on execution feedback, we show that LLMs can produce adequate translations with few-shot learning. 2) We evaluate the performance of TabPFN, a recent neural tabular classifier trained on Bayesian priors, on 2,720 tables with input-target columns identified and annotated by LLMs. On average, TabPFN performs on par with the baseline AutoML method, though the relative performance can vary significantly from one data table to another, making both models viable for practical applications depending on the situation. Our findings underscore the potential of LLMs in automating the annotation of large volumes of diverse tabular data.

AnnotatedTables: A Large Tabular Dataset with Language Model Annotations

TL;DR

This work tackles the bottleneck of annotating tabular data for machine learning by leveraging large language models to synthesize scalable, executable SQL annotations and input-target columns. It introduces AnnotatedTables, the largest SQL dataset with tabular data that supports query execution (32,119 databases and 405,616 valid SQL programs), and demonstrates two validation studies: translating SQL to a new language Rel via Incremental Prompt Engineering and evaluating TabPFN on a broad, real-world suite of tables annotated by LLMs. The results show substantial but imperfect LLM annotation quality (82.25% valid SQL, ~40% SQL-to-Rel execution accuracy), with clear insights into when LLMs excel and where limitations arise. Overall, the work demonstrates that LLM-driven annotation can dramatically reduce human labor in building large, diverse tabular datasets and offers a flexible framework for targeted research in database language translation and tabular classification.

Abstract

Tabular data is ubiquitous in real-world applications and abundant on the web, yet its annotation has traditionally required human labor, posing a significant scalability bottleneck for tabular machine learning. Our methodology can successfully annotate a large amount of tabular data and can be flexibly steered to generate various types of annotations based on specific research objectives, as we demonstrate with SQL annotation and input-target column annotation as examples. As a result, we release AnnotatedTables, a collection of 32,119 databases with LLM-generated annotations. The dataset includes 405,616 valid SQL programs, making it the largest SQL dataset with associated tabular data that supports query execution. To further demonstrate the value of our methodology and dataset, we perform two follow-up research studies. 1) We investigate whether LLMs can translate SQL programs to Rel programs, a database language previously unknown to LLMs, while obtaining the same execution results. Using our Incremental Prompt Engineering methods based on execution feedback, we show that LLMs can produce adequate translations with few-shot learning. 2) We evaluate the performance of TabPFN, a recent neural tabular classifier trained on Bayesian priors, on 2,720 tables with input-target columns identified and annotated by LLMs. On average, TabPFN performs on par with the baseline AutoML method, though the relative performance can vary significantly from one data table to another, making both models viable for practical applications depending on the situation. Our findings underscore the potential of LLMs in automating the annotation of large volumes of diverse tabular data.

Paper Structure

This paper contains 40 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: An illustration of the SQL code annotation process with a large language model.
  • Figure 2: With Incremental Prompt Engineering, translation accuracy gradually improves and converges as more examples are added.
  • Figure 3: The AUROC (OVO) of TabPFN versus baseline AutoGluon with 1-minute time budget on the tabular classification problems in AnnotatedTables.
  • Figure 4: Bar plots for the performance metrics of TabPFN and AutoGluon. The outliers are not plotted as they are far from the quantile bars.