Table of Contents
Fetching ...

Position: Foundation Models for Tabular Data within Systemic Contexts Need Grounding

Tassilo Klein, Johannes Hoffart

TL;DR

The paper argues that foundation models for tabular data fail to generalize without grounding in the operational contexts that produce and govern data. It introduces Semantically Linked Tables (SLT) and Foundation Models for SLT (FMSLT), advocating dual-phase training on open-source code and synthetic systems, followed by zero-shot adaptation to private data. An Operational Turing Test is proposed to benchmark true operational grounding, complemented by privacy-preserving sandboxes and synthetic data strategies to overcome data silos. The authors outline a Retrieval-Augmented In-Context Learning architecture and argue for SLT-native architectures that integrate declarative rules, procedural code, and world knowledge. If realized, FMSLTs would enable autonomous agents to reason over complex workflows with robust, context-aware predictions and explanations, transforming grounded decision-making in enterprise data ecosystems.

Abstract

This position paper argues that foundation models for tabular data face inherent limitations when isolated from operational context - the procedural logic, declarative rules, and domain knowledge that define how data is created and governed. Current approaches focus on single-table generalization or schema-level relationships, fundamentally missing the operational knowledge that gives data meaning. We introduce Semantically Linked Tables (SLT) and Foundation Models for SLT (FMSLT) as a new model class that grounds tabular data in its operational context. We propose dual-phase training: pre-training on open-source code-data pairs and synthetic systems to learn business logic mechanics, followed by zero-shot inference on proprietary data. We introduce the ``Operational Turing Test'' benchmark and argue that operational grounding is essential for autonomous agents in complex data environments.

Position: Foundation Models for Tabular Data within Systemic Contexts Need Grounding

TL;DR

The paper argues that foundation models for tabular data fail to generalize without grounding in the operational contexts that produce and govern data. It introduces Semantically Linked Tables (SLT) and Foundation Models for SLT (FMSLT), advocating dual-phase training on open-source code and synthetic systems, followed by zero-shot adaptation to private data. An Operational Turing Test is proposed to benchmark true operational grounding, complemented by privacy-preserving sandboxes and synthetic data strategies to overcome data silos. The authors outline a Retrieval-Augmented In-Context Learning architecture and argue for SLT-native architectures that integrate declarative rules, procedural code, and world knowledge. If realized, FMSLTs would enable autonomous agents to reason over complex workflows with robust, context-aware predictions and explanations, transforming grounded decision-making in enterprise data ecosystems.

Abstract

This position paper argues that foundation models for tabular data face inherent limitations when isolated from operational context - the procedural logic, declarative rules, and domain knowledge that define how data is created and governed. Current approaches focus on single-table generalization or schema-level relationships, fundamentally missing the operational knowledge that gives data meaning. We introduce Semantically Linked Tables (SLT) and Foundation Models for SLT (FMSLT) as a new model class that grounds tabular data in its operational context. We propose dual-phase training: pre-training on open-source code-data pairs and synthetic systems to learn business logic mechanics, followed by zero-shot inference on proprietary data. We introduce the ``Operational Turing Test'' benchmark and argue that operational grounding is essential for autonomous agents in complex data environments.

Paper Structure

This paper contains 12 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: The composition of SLT.
  • Figure 2: SLT Example: A manufacturing supply chain with webshop integration. Tables are linked through foreign keys and governed by operational logic (configuration rules, inventory validation) that defines valid data states.
  • Figure 3: Mockup schema.
  • Figure 4: Retrieval-Augmented In-Context Learning Architecture. Training learns reasoning from synthetic system-table pairs; inference retrieves operational artifacts and applies learned ICL for zero-shot prediction on private tables.
  • Figure 5: Conceptual positioning of FMSLT relative to existing paradigms, categorized by data richness (y-axis) and model capabilities (x-axis).

Theorems & Definitions (2)

  • Definition 2.1: Operational Knowledge
  • Definition 6.1: The Operational Turing Test