Towards Cross-Table Masked Pretraining for Web Data Mining

Chao Ye; Guoshan Lu; Haobo Wang; Liyao Li; Sai Wu; Gang Chen; Junbo Zhao

Towards Cross-Table Masked Pretraining for Web Data Mining

Chao Ye, Guoshan Lu, Haobo Wang, Liyao Li, Sai Wu, Gang Chen, Junbo Zhao

TL;DR

This work tackles the lack of scalable cross-table pretraining for web-derived tabular data by introducing CM2, a semantic-aware transformer-based framework with a novel prompt Masked Table Modeling objective. It is trained on a large, curated OpenTabs dataset to learn cross-table knowledge and uniform encodings of heterogeneous tables, enabling transfer to diverse downstream tasks. CM2 achieves state-of-the-art results on multiple tabular benchmarks, demonstrates strong few-shot generalization, and validates that cross-table pretraining can significantly improve tabular data mining on the web. The approach promises to provide scalable, shareable tabular representations applicable to a wide range of data mining tasks and supports future scaling toward a BERT-like moment for tabular data.

Abstract

Tabular data pervades the landscape of the World Wide Web, playing a foundational role in the digital architecture that underpins online information. Given the recent influence of large-scale pretrained models like ChatGPT and SAM across various domains, exploring the application of pretraining techniques for mining tabular data on the web has emerged as a highly promising research direction. Indeed, there have been some recent works around this topic where most (if not all) of them are limited in the scope of a fixed-schema/single table. Due to the scale of the dataset and the parameter size of the prior models, we believe that we have not reached the ''BERT moment'' for the ubiquitous tabular data. The development on this line significantly lags behind the counterpart research domains such as natural language processing. In this work, we first identify the crucial challenges behind tabular data pretraining, particularly overcoming the cross-table hurdle. As a pioneering endeavor, this work mainly (i)-contributes a high-quality real-world tabular dataset, (ii)-proposes an innovative, generic, and efficient cross-table pretraining framework, dubbed as CM2, where the core to it comprises a semantic-aware tabular neural network that uniformly encodes heterogeneous tables without much restriction and (iii)-introduces a novel pretraining objective -- prompt Masked Table Modeling (pMTM) -- inspired by NLP but intricately tailored to scalable pretraining on tables. Our extensive experiments demonstrate CM2's state-of-the-art performance and validate that cross-table pretraining can enhance various downstream tasks.

Towards Cross-Table Masked Pretraining for Web Data Mining

TL;DR

Abstract

Paper Structure (32 sections, 11 equations, 8 figures, 5 tables)

This paper contains 32 sections, 11 equations, 8 figures, 5 tables.

Introduction
Challenges
Our Solution
Related Works
OpenTabs: A Large-Scale Tabular Dataset From Web
Methods
Task Formulation
Semantic-aware Tabular Neural Network
Feature Encoder for Heterogeneous Tables
Feature Interaction
Cross-table Pretraining Objective
Fine-Tuning on Downstream Tabular Tasks
Experiments
Experimental Setup
Datasets
...and 17 more sections

Figures (8)

Figure 1: Compared to the mature pretraining techniques in CV or NLP, how to pretrain a universal tabular model for mining widely prevalent relational tables on the Web remains an underexplored area. We focus on solving it.
Figure 2: The differences in combining column schema of table between past works (good for structured semantic understanding) and CM2 (better suited for tabular prediction).
Figure 3: Statistics of our OpenTabs dataset composition.
Figure 4: The overview of the proposed cross-table pretraining framework CM2. OpenTabs (Section \ref{['dataset']}) is the pretraining dataset contributed by us. Firstly, we employ a feature encoder to uniformly process these heterogeneous tables and obtain feature embeddings. Then we mask some features and replace them with a shared, learnable vector. Finally, our pretraining objective aims at recovering these masked features based on the retained features and schema information prompt. Better view in color.
Figure 5: Ablation studies on comparing CM2 with learning from scratch (w/o pretraining).
...and 3 more figures

Towards Cross-Table Masked Pretraining for Web Data Mining

TL;DR

Abstract

Towards Cross-Table Masked Pretraining for Web Data Mining

Authors

TL;DR

Abstract

Table of Contents

Figures (8)