Comparing Task-Agnostic Embedding Models for Tabular Data
Frederik Hoppe, Lars Kleinemeier, Astrid Franz, Udo Göbel
TL;DR
The paper investigates whether task-agnostic embeddings can emerge for tabular data by evaluating embeddings from three sources—TableVectorizer (classical feature engineering), TabPFN, and TabICL (tabular foundation models)—as per-row representations on ADBench (outlier detection) and TabArena Lite (supervised tasks). It finds that TableVectorizer often matches or outperforms the more complex models in task-agnostic settings and does so with far lower dimensionality (embeddings up to $166$) and computation, while TabICL can excel on some supervised tasks and TabPFN lags in efficiency. These results suggest that simple, transferable tabular embeddings are feasible and competitive, but broader validation, more models, and semantic or relational extensions are needed to realize robust, universal, task-agnostic representations. The work highlights practical trade-offs between embedding size, speed, and downstream performance and outlines directions for future research, including semantic enrichment, context management, and cross-dataset or cross-modal alignment.
Abstract
Recent foundation models for tabular data achieve strong task-specific performance via in-context learning. Nevertheless, they focus on direct prediction by encapsulating both representation learning and task-specific inference inside a single, resource-intensive network. This work specifically focuses on representation learning, i.e., on transferable, task-agnostic embeddings. We systematically evaluate task-agnostic representations from tabular foundation models (TabPFN and TabICL) alongside with classical feature engineering (TableVectorizer) across a variety of application tasks as outlier detection (ADBench) and supervised learning (TabArena Lite). We find that simple TableVectorizer features achieve comparable or superior performance while being up to three orders of magnitude faster than tabular foundation models. The code is available at https://github.com/ContactSoftwareAI/TabEmbedBench.
