Position: Foundation Models for Tabular Data within Systemic Contexts Need Grounding
Tassilo Klein, Johannes Hoffart
TL;DR
The paper argues that foundation models for tabular data fail to generalize without grounding in the operational contexts that produce and govern data. It introduces Semantically Linked Tables (SLT) and Foundation Models for SLT (FMSLT), advocating dual-phase training on open-source code and synthetic systems, followed by zero-shot adaptation to private data. An Operational Turing Test is proposed to benchmark true operational grounding, complemented by privacy-preserving sandboxes and synthetic data strategies to overcome data silos. The authors outline a Retrieval-Augmented In-Context Learning architecture and argue for SLT-native architectures that integrate declarative rules, procedural code, and world knowledge. If realized, FMSLTs would enable autonomous agents to reason over complex workflows with robust, context-aware predictions and explanations, transforming grounded decision-making in enterprise data ecosystems.
Abstract
This position paper argues that foundation models for tabular data face inherent limitations when isolated from operational context - the procedural logic, declarative rules, and domain knowledge that define how data is created and governed. Current approaches focus on single-table generalization or schema-level relationships, fundamentally missing the operational knowledge that gives data meaning. We introduce Semantically Linked Tables (SLT) and Foundation Models for SLT (FMSLT) as a new model class that grounds tabular data in its operational context. We propose dual-phase training: pre-training on open-source code-data pairs and synthetic systems to learn business logic mechanics, followed by zero-shot inference on proprietary data. We introduce the ``Operational Turing Test'' benchmark and argue that operational grounding is essential for autonomous agents in complex data environments.
