Table of Contents
Fetching ...

The Illusion of Generalization: Re-examining Tabular Language Model Evaluation

Aditya Gorla, Ratish Puduppully

TL;DR

This work challenges the claimed generalization of Tabular Language Models (TLMs) by re-evaluating Tabula-8B on 165 UniPredict datasets. It reveals that strong performance is largely driven by (i) task-type bias, with quartile classification inflating aggregate metrics, (ii) pervasive data contamination including train-test overlap and task leakage, and (iii) instruction-tuning without tabular exposure, which accounts for most of the gains. By disentangling these factors through baseline analyses, contamination checks, and comparisons to instruction-tuned baselines (e.g., Alpaca) and quartile-format augmented models (Alpaca+Q), the paper argues that reported generalization reflects evaluation artifacts rather than genuine tabular reasoning. It provides seven concrete recommendations to improve evaluation standards, such as reporting baselines, stratifying by task type, releasing evaluation code, and auditing the need for tabular reasoning. The findings call for more rigorous, contamination-aware, and task-type-aware benchmarking to ensure trustworthy progress in TLM research and deployment.

Abstract

Tabular Language Models (TLMs) have been claimed to achieve emergent generalization for tabular prediction. We conduct a systematic re-evaluation of Tabula-8B as a representative TLM, utilizing 165 datasets from the UniPredict benchmark. Our investigation reveals three findings. First, binary and categorical classification achieve near-zero median lift over majority-class baselines and strong aggregate performance is driven entirely by quartile classification tasks. Second, top-performing datasets exhibit pervasive contamination, including complete train-test overlap and task-level leakage that evades standard deduplication. Third, instruction-tuning without tabular exposure recovers 92.2% of standard classification performance and on quartile classification, format familiarity closes 71.3% of the gap with the residual attributable to contaminated datasets. These findings suggest claimed generalization likely reflects evaluation artifacts rather than learned tabular reasoning. We conclude with recommendations for strengthening TLM evaluation.

The Illusion of Generalization: Re-examining Tabular Language Model Evaluation

TL;DR

This work challenges the claimed generalization of Tabular Language Models (TLMs) by re-evaluating Tabula-8B on 165 UniPredict datasets. It reveals that strong performance is largely driven by (i) task-type bias, with quartile classification inflating aggregate metrics, (ii) pervasive data contamination including train-test overlap and task leakage, and (iii) instruction-tuning without tabular exposure, which accounts for most of the gains. By disentangling these factors through baseline analyses, contamination checks, and comparisons to instruction-tuned baselines (e.g., Alpaca) and quartile-format augmented models (Alpaca+Q), the paper argues that reported generalization reflects evaluation artifacts rather than genuine tabular reasoning. It provides seven concrete recommendations to improve evaluation standards, such as reporting baselines, stratifying by task type, releasing evaluation code, and auditing the need for tabular reasoning. The findings call for more rigorous, contamination-aware, and task-type-aware benchmarking to ensure trustworthy progress in TLM research and deployment.

Abstract

Tabular Language Models (TLMs) have been claimed to achieve emergent generalization for tabular prediction. We conduct a systematic re-evaluation of Tabula-8B as a representative TLM, utilizing 165 datasets from the UniPredict benchmark. Our investigation reveals three findings. First, binary and categorical classification achieve near-zero median lift over majority-class baselines and strong aggregate performance is driven entirely by quartile classification tasks. Second, top-performing datasets exhibit pervasive contamination, including complete train-test overlap and task-level leakage that evades standard deduplication. Third, instruction-tuning without tabular exposure recovers 92.2% of standard classification performance and on quartile classification, format familiarity closes 71.3% of the gap with the residual attributable to contaminated datasets. These findings suggest claimed generalization likely reflects evaluation artifacts rather than learned tabular reasoning. We conclude with recommendations for strengthening TLM evaluation.
Paper Structure (69 sections, 7 figures, 13 tables)

This paper contains 69 sections, 7 figures, 13 tables.

Figures (7)

  • Figure 1: Performance decomposition by task type compared to aggregated metrics. Raw accuracy in red and lift over majority-class baseline in blue. Dotted line indicates 0 performance level for respective metrics.
  • Figure 2: Distribution of accuracy lift over majority-class baseline accuracy lift over task type. Dotted line indicate 0 accuracy lift.
  • Figure 3: Examples of data contamination in the T4 and Unipredict datasets. (a) Complete Overlap: A test row from us-womens-labor perfectly matches a T4 record, exposing the label. (b) Task Leakage: In peloton-data, while passing row-level deduplication, the date-to-day mapping is encoded in 844 unrelated T4 records; a representative match is shown, enabling solvability via memorization.
  • Figure 4: Lift over majority-class baseline by task type for Tabula-8B, Base Llama, Alpaca, and Alpaca+Q. Alpaca+Q was instruction-tuned for quartile classification only is not applicable to other tasks. The dotted line indicates the zero performance level relative to the majority-class baseline.
  • Figure 5: Cohen's $\kappa$ distribution by task type. Horizontal lines indicate standard interpretation thresholds.
  • ...and 2 more figures