TabLib: A Dataset of 627M Tables with Context
Gus Eggert, Kevin Huo, Mike Biven, Justin Waugh
TL;DR
TabLib presents a dataset of 627 million tables totaling 69 TiB, with 867B tokens of contextual information, sourced from GitHub and Common Crawl and spanning formats such as CSV, HTML, PDF, Excel, SQLite, and more. The authors implement a scalable, metadata-rich pipeline to extract, normalize, and store tabular data with provenance, enabling deduplication and analysis at scale. They demonstrate long-tail, Zipf-like distributions in table statistics, substantial data duplication, and diverse language and data-type profiles, while also examining ethics, biases, licensing, and limitations. The work argues that TabLib can catalyze progress in tabular data understanding and large tabular data model development, with practical implications for dataset construction, benchmarking, and pre-training of tabular AI systems.
Abstract
It is well-established that large, diverse datasets play a pivotal role in the performance of modern AI systems for text and image modalities. However, there are no datasets for tabular data of comparable size and diversity to those available for text and images. Thus we present "TabLib'', a compilation of 627 million tables totaling 69 TiB, along with 867B tokens of context. TabLib was extracted from numerous file formats, including CSV, HTML, SQLite, PDF, Excel, and others, sourced from GitHub and Common Crawl. The size and diversity of TabLib offer considerable promise in the table modality, reminiscent of the original promise of foundational datasets for text and images, such as The Pile and LAION.
