Table of Contents
Fetching ...

Structural Deep Encoding for Table Question Answering

Raphaël Mouravieff, Benjamin Piwowarski, Sylvain Lamprier

TL;DR

This work addresses the challenge of processing tabular data with Transformers by preserving structural information through sparse attention and absolute encodings. It systematically evaluates existing table encoding methods and introduces novel sparse attention masks and structural modules to improve generalization and scalability. The key contributions include an ANOVA-based analysis of encoding factors, the M1 and M3 sparse masks, and empirical validation on synthetic data and real datasets like WikiSQL and WTQ. The findings demonstrate that sparse attention, combined with absolute positional cues, yields better generalization and substantial computational speedups for large tables, with practical implications for scalable table QA and related tasks.

Abstract

Although Transformers-based architectures excel at processing textual information, their naive adaptation for tabular data often involves flattening the table structure. This simplification can lead to the loss of essential inter-dependencies between rows, columns, and cells, while also posing scalability challenges for large tables. To address these issues, prior works have explored special tokens, structured embeddings, and sparse attention patterns. In this paper, we conduct a comprehensive analysis of tabular encoding techniques, which highlights the crucial role of attention sparsity in preserving structural information of tables. We also introduce a set of novel sparse attention mask designs for tabular data, that not only enhance computational efficiency but also preserve structural integrity, leading to better overall performance.

Structural Deep Encoding for Table Question Answering

TL;DR

This work addresses the challenge of processing tabular data with Transformers by preserving structural information through sparse attention and absolute encodings. It systematically evaluates existing table encoding methods and introduces novel sparse attention masks and structural modules to improve generalization and scalability. The key contributions include an ANOVA-based analysis of encoding factors, the M1 and M3 sparse masks, and empirical validation on synthetic data and real datasets like WikiSQL and WTQ. The findings demonstrate that sparse attention, combined with absolute positional cues, yields better generalization and substantial computational speedups for large tables, with practical implications for scalable table QA and related tasks.

Abstract

Although Transformers-based architectures excel at processing textual information, their naive adaptation for tabular data often involves flattening the table structure. This simplification can lead to the loss of essential inter-dependencies between rows, columns, and cells, while also posing scalability challenges for large tables. To address these issues, prior works have explored special tokens, structured embeddings, and sparse attention patterns. In this paper, we conduct a comprehensive analysis of tabular encoding techniques, which highlights the crucial role of attention sparsity in preserving structural information of tables. We also introduce a set of novel sparse attention mask designs for tabular data, that not only enhance computational efficiency but also preserve structural integrity, leading to better overall performance.

Paper Structure

This paper contains 37 sections, 1 equation, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Overview of the encoding pipeline (blue) along with the different steps where table-specific information can be injected (red)
  • Figure 2: Denotation Accuracy differences between two structural encoding components (left $-$ right) while keeping all other factors unchanged.
  • Figure 3: Cumulative Distribution of Sequence Lengths and Relative Computation Speedup: The primary y-axis (left) represents the cumulative distribution of sequence lengths in log scale, while the secondary y-axis (right) shows the relative computation speedup for FlexAttention and FlashAttention2 across different sequence lengths in x-axis.
  • Figure 4: Results for TAPAS and TAPAS+M1 under varying Mixability levels (S). All models have been trained on data with S=1, where the transition matrix for table creation is fully deterministic, and tested on increasingly challenging similarity levels, down to S=0, where the transition matrix is uniformly random. For this experiment, we exclusively used the “SELECT cx WHERE cy = vy” template SQL query.
  • Figure 5: This figure highlights the differences between two structural encoding components while keeping all other factors unchanged.
  • ...and 4 more figures