The Illusion of Generalization: Re-examining Tabular Language Model Evaluation
Aditya Gorla, Ratish Puduppully
TL;DR
This work challenges the claimed generalization of Tabular Language Models (TLMs) by re-evaluating Tabula-8B on 165 UniPredict datasets. It reveals that strong performance is largely driven by (i) task-type bias, with quartile classification inflating aggregate metrics, (ii) pervasive data contamination including train-test overlap and task leakage, and (iii) instruction-tuning without tabular exposure, which accounts for most of the gains. By disentangling these factors through baseline analyses, contamination checks, and comparisons to instruction-tuned baselines (e.g., Alpaca) and quartile-format augmented models (Alpaca+Q), the paper argues that reported generalization reflects evaluation artifacts rather than genuine tabular reasoning. It provides seven concrete recommendations to improve evaluation standards, such as reporting baselines, stratifying by task type, releasing evaluation code, and auditing the need for tabular reasoning. The findings call for more rigorous, contamination-aware, and task-type-aware benchmarking to ensure trustworthy progress in TLM research and deployment.
Abstract
Tabular Language Models (TLMs) have been claimed to achieve emergent generalization for tabular prediction. We conduct a systematic re-evaluation of Tabula-8B as a representative TLM, utilizing 165 datasets from the UniPredict benchmark. Our investigation reveals three findings. First, binary and categorical classification achieve near-zero median lift over majority-class baselines and strong aggregate performance is driven entirely by quartile classification tasks. Second, top-performing datasets exhibit pervasive contamination, including complete train-test overlap and task-level leakage that evades standard deduplication. Third, instruction-tuning without tabular exposure recovers 92.2% of standard classification performance and on quartile classification, format familiarity closes 71.3% of the gap with the residual attributable to contaminated datasets. These findings suggest claimed generalization likely reflects evaluation artifacts rather than learned tabular reasoning. We conclude with recommendations for strengthening TLM evaluation.
