Table of Contents
Fetching ...

TabLib: A Dataset of 627M Tables with Context

Gus Eggert, Kevin Huo, Mike Biven, Justin Waugh

TL;DR

TabLib presents a dataset of 627 million tables totaling 69 TiB, with 867B tokens of contextual information, sourced from GitHub and Common Crawl and spanning formats such as CSV, HTML, PDF, Excel, SQLite, and more. The authors implement a scalable, metadata-rich pipeline to extract, normalize, and store tabular data with provenance, enabling deduplication and analysis at scale. They demonstrate long-tail, Zipf-like distributions in table statistics, substantial data duplication, and diverse language and data-type profiles, while also examining ethics, biases, licensing, and limitations. The work argues that TabLib can catalyze progress in tabular data understanding and large tabular data model development, with practical implications for dataset construction, benchmarking, and pre-training of tabular AI systems.

Abstract

It is well-established that large, diverse datasets play a pivotal role in the performance of modern AI systems for text and image modalities. However, there are no datasets for tabular data of comparable size and diversity to those available for text and images. Thus we present "TabLib'', a compilation of 627 million tables totaling 69 TiB, along with 867B tokens of context. TabLib was extracted from numerous file formats, including CSV, HTML, SQLite, PDF, Excel, and others, sourced from GitHub and Common Crawl. The size and diversity of TabLib offer considerable promise in the table modality, reminiscent of the original promise of foundational datasets for text and images, such as The Pile and LAION.

TabLib: A Dataset of 627M Tables with Context

TL;DR

TabLib presents a dataset of 627 million tables totaling 69 TiB, with 867B tokens of contextual information, sourced from GitHub and Common Crawl and spanning formats such as CSV, HTML, PDF, Excel, SQLite, and more. The authors implement a scalable, metadata-rich pipeline to extract, normalize, and store tabular data with provenance, enabling deduplication and analysis at scale. They demonstrate long-tail, Zipf-like distributions in table statistics, substantial data duplication, and diverse language and data-type profiles, while also examining ethics, biases, licensing, and limitations. The work argues that TabLib can catalyze progress in tabular data understanding and large tabular data model development, with practical implications for dataset construction, benchmarking, and pre-training of tabular AI systems.

Abstract

It is well-established that large, diverse datasets play a pivotal role in the performance of modern AI systems for text and image modalities. However, there are no datasets for tabular data of comparable size and diversity to those available for text and images. Thus we present "TabLib'', a compilation of 627 million tables totaling 69 TiB, along with 867B tokens of context. TabLib was extracted from numerous file formats, including CSV, HTML, SQLite, PDF, Excel, and others, sourced from GitHub and Common Crawl. The size and diversity of TabLib offer considerable promise in the table modality, reminiscent of the original promise of foundational datasets for text and images, such as The Pile and LAION.
Paper Structure (43 sections, 7 figures, 4 tables)

This paper contains 43 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Architecture of table extraction pipeline.
  • Figure 2: Power law behavior of table statistics. The (a) row-count, (b) column-count, and (c) domain-size (column-level unique-count) exhibit power-law-esque distributions, with a tail end following less close to a theoretical fit. The solid line shows the empirical distribution and the dotted line shows the theoretical fit given the relevant alpha value.
  • Figure 3: Content Hash Duplication Frequencies By Source. Duplication based on content_hash shows a Zipf-like distribution when comparing frequency versus rank for both Github and Common Crawl.
  • Figure 4: 2D histogram of content hash distinct values. There is a wide variance of duplicate context_metadata values among tables with duplicated content_hash, for both CommonCrawl and Github. The y-axis is the log of the distinct context_metadata counts, and the x-axis is the log of the total number of duplicated values for a given content hash. Both are on log scale with log bins, and the color reflects a normalized density.
  • Figure 5: Data Categories Breakdown by File Type and Data Source. CC is Common Crawl, and GH is GitHub. HTML is the majority of content across most categories, and GitHub is predominantly of the category "Software and Technology". Note the x-axis has frequencies normalized by data source, and the y-axis of categories is sorted based on the normalized frequency values on GitHub. The x-axis is broken to prevent the high proportion of "Software and Technology" for GitHub from dominating the figure.
  • ...and 2 more figures