Table of Contents
Fetching ...

TABBIE: Pretrained Representations of Tabular Data

Hiroshi Iida, Dung Thai, Varun Manjunatha, Mohit Iyyer

TL;DR

This work targets tabular data beyond text-associated tables by introducing TABBIE, a table-only pretraining method. TABBIE uses two Transformers to separately encode rows and columns and adopts an ELECTRA-style corrupt-cell detection objective, enabling efficient learning of embeddings for cells, rows, and columns. It achieves state-of-the-art or competitive results on column population, row population, and column type prediction while requiring far less compute than prior table-text models. Qualitative analyses show that TABBIE captures complex table semantics and numeric trends, and the authors release pretrained models and code to advance table-based representation learning.

Abstract

Existing work on tabular representation learning jointly models tables and associated text using self-supervised objective functions derived from pretrained language models such as BERT. While this joint pretraining improves tasks involving paired tables and text (e.g., answering questions about tables), we show that it underperforms on tasks that operate over tables without any associated text (e.g., populating missing cells). We devise a simple pretraining objective (corrupt cell detection) that learns exclusively from tabular data and reaches the state-of-the-art on a suite of table based prediction tasks. Unlike competing approaches, our model (TABBIE) provides embeddings of all table substructures (cells, rows, and columns), and it also requires far less compute to train. A qualitative analysis of our model's learned cell, column, and row representations shows that it understands complex table semantics and numerical trends.

TABBIE: Pretrained Representations of Tabular Data

TL;DR

This work targets tabular data beyond text-associated tables by introducing TABBIE, a table-only pretraining method. TABBIE uses two Transformers to separately encode rows and columns and adopts an ELECTRA-style corrupt-cell detection objective, enabling efficient learning of embeddings for cells, rows, and columns. It achieves state-of-the-art or competitive results on column population, row population, and column type prediction while requiring far less compute than prior table-text models. Qualitative analyses show that TABBIE captures complex table semantics and numeric trends, and the authors release pretrained models and code to advance table-based representation learning.

Abstract

Existing work on tabular representation learning jointly models tables and associated text using self-supervised objective functions derived from pretrained language models such as BERT. While this joint pretraining improves tasks involving paired tables and text (e.g., answering questions about tables), we show that it underperforms on tasks that operate over tables without any associated text (e.g., populating missing cells). We devise a simple pretraining objective (corrupt cell detection) that learns exclusively from tabular data and reaches the state-of-the-art on a suite of table based prediction tasks. Unlike competing approaches, our model (TABBIE) provides embeddings of all table substructures (cells, rows, and columns), and it also requires far less compute to train. A qualitative analysis of our model's learned cell, column, and row representations shows that it understands complex table semantics and numerical trends.

Paper Structure

This paper contains 23 sections, 2 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: tabbie is a table embedding model trained to detect corrupted cells, inspired by the ELECTRA Clark2020ELECTRA: objective function. This simple pretraining objective results in powerful embeddings of cells, columns, and rows, and it yields state-of-the-art results on downstream table-based tasks.
  • Figure 2: tabbie's computations at one layer. For a given table, the row Transformer contextualizes the representations of the cells in each row, while the column Transformer similarly contextualizes cells in each column. The final cell representation is an average of the row and column embeddings, which is passed as input to the next layer. [cls] tokens are prepended to each row and column to facilitate downstream tasks operating on table substructures.
  • Figure 3: The different cell corruption strategies used in our experiments.
  • Figure 4: The inputs and outputs for each of our table-based prediction tasks. Column type prediction does not include headers as part of the table.
  • Figure 5: In this figure, (b) and (c) contain the predicted corruption probability of each cell in (a). Only tabbieMIX is able to reliably identify violations of numerical trends in columns.
  • ...and 2 more figures