Table of Contents
Fetching ...

Towards Cross-Table Masked Pretraining for Web Data Mining

Chao Ye, Guoshan Lu, Haobo Wang, Liyao Li, Sai Wu, Gang Chen, Junbo Zhao

TL;DR

This work tackles the lack of scalable cross-table pretraining for web-derived tabular data by introducing CM2, a semantic-aware transformer-based framework with a novel prompt Masked Table Modeling objective. It is trained on a large, curated OpenTabs dataset to learn cross-table knowledge and uniform encodings of heterogeneous tables, enabling transfer to diverse downstream tasks. CM2 achieves state-of-the-art results on multiple tabular benchmarks, demonstrates strong few-shot generalization, and validates that cross-table pretraining can significantly improve tabular data mining on the web. The approach promises to provide scalable, shareable tabular representations applicable to a wide range of data mining tasks and supports future scaling toward a BERT-like moment for tabular data.

Abstract

Tabular data pervades the landscape of the World Wide Web, playing a foundational role in the digital architecture that underpins online information. Given the recent influence of large-scale pretrained models like ChatGPT and SAM across various domains, exploring the application of pretraining techniques for mining tabular data on the web has emerged as a highly promising research direction. Indeed, there have been some recent works around this topic where most (if not all) of them are limited in the scope of a fixed-schema/single table. Due to the scale of the dataset and the parameter size of the prior models, we believe that we have not reached the ''BERT moment'' for the ubiquitous tabular data. The development on this line significantly lags behind the counterpart research domains such as natural language processing. In this work, we first identify the crucial challenges behind tabular data pretraining, particularly overcoming the cross-table hurdle. As a pioneering endeavor, this work mainly (i)-contributes a high-quality real-world tabular dataset, (ii)-proposes an innovative, generic, and efficient cross-table pretraining framework, dubbed as CM2, where the core to it comprises a semantic-aware tabular neural network that uniformly encodes heterogeneous tables without much restriction and (iii)-introduces a novel pretraining objective -- prompt Masked Table Modeling (pMTM) -- inspired by NLP but intricately tailored to scalable pretraining on tables. Our extensive experiments demonstrate CM2's state-of-the-art performance and validate that cross-table pretraining can enhance various downstream tasks.

Towards Cross-Table Masked Pretraining for Web Data Mining

TL;DR

This work tackles the lack of scalable cross-table pretraining for web-derived tabular data by introducing CM2, a semantic-aware transformer-based framework with a novel prompt Masked Table Modeling objective. It is trained on a large, curated OpenTabs dataset to learn cross-table knowledge and uniform encodings of heterogeneous tables, enabling transfer to diverse downstream tasks. CM2 achieves state-of-the-art results on multiple tabular benchmarks, demonstrates strong few-shot generalization, and validates that cross-table pretraining can significantly improve tabular data mining on the web. The approach promises to provide scalable, shareable tabular representations applicable to a wide range of data mining tasks and supports future scaling toward a BERT-like moment for tabular data.

Abstract

Tabular data pervades the landscape of the World Wide Web, playing a foundational role in the digital architecture that underpins online information. Given the recent influence of large-scale pretrained models like ChatGPT and SAM across various domains, exploring the application of pretraining techniques for mining tabular data on the web has emerged as a highly promising research direction. Indeed, there have been some recent works around this topic where most (if not all) of them are limited in the scope of a fixed-schema/single table. Due to the scale of the dataset and the parameter size of the prior models, we believe that we have not reached the ''BERT moment'' for the ubiquitous tabular data. The development on this line significantly lags behind the counterpart research domains such as natural language processing. In this work, we first identify the crucial challenges behind tabular data pretraining, particularly overcoming the cross-table hurdle. As a pioneering endeavor, this work mainly (i)-contributes a high-quality real-world tabular dataset, (ii)-proposes an innovative, generic, and efficient cross-table pretraining framework, dubbed as CM2, where the core to it comprises a semantic-aware tabular neural network that uniformly encodes heterogeneous tables without much restriction and (iii)-introduces a novel pretraining objective -- prompt Masked Table Modeling (pMTM) -- inspired by NLP but intricately tailored to scalable pretraining on tables. Our extensive experiments demonstrate CM2's state-of-the-art performance and validate that cross-table pretraining can enhance various downstream tasks.
Paper Structure (32 sections, 11 equations, 8 figures, 5 tables)

This paper contains 32 sections, 11 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Compared to the mature pretraining techniques in CV or NLP, how to pretrain a universal tabular model for mining widely prevalent relational tables on the Web remains an underexplored area. We focus on solving it.
  • Figure 2: The differences in combining column schema of table between past works (good for structured semantic understanding) and CM2 (better suited for tabular prediction).
  • Figure 3: Statistics of our OpenTabs dataset composition.
  • Figure 4: The overview of the proposed cross-table pretraining framework CM2. OpenTabs (Section \ref{['dataset']}) is the pretraining dataset contributed by us. Firstly, we employ a feature encoder to uniformly process these heterogeneous tables and obtain feature embeddings. Then we mask some features and replace them with a shared, learnable vector. Finally, our pretraining objective aims at recovering these masked features based on the retained features and schema information prompt. Better view in color.
  • Figure 5: Ablation studies on comparing CM2 with learning from scratch (w/o pretraining).
  • ...and 3 more figures