Table of Contents
Fetching ...

Mind the Data Gap: Bridging LLMs to Enterprise Data Integration

Moe Kayali, Fabian Wenz, Nesime Tatbul, Çağatay Demiralp

TL;DR

LLMs trained on public data underperform on real enterprise data for data integration due to dark data and distribution shifts. The authors introduce Goby, a private enterprise benchmark for semantic column-type annotation, and develop a hierarchical, ontology-based framework—including iterative dictionary construction, ontology synthesis, and full-context encoding—to uplift LLM performance. On Goby, raw LLM performance drops relative to public benchmarks, but tree-serialization-based encoding with full ontology context can reach $F_1$ scores around 0.85, approaching public-data levels. This work provides a realistic evaluation platform and practical methods for bridging the enterprise data gap, encouraging broader adoption of Goby for benchmarking and method development.

Abstract

Leading large language models (LLMs) are trained on public data. However, most of the world's data is dark data that is not publicly accessible, mainly in the form of private organizational or enterprise data. We show that the performance of methods based on LLMs seriously degrades when tested on real-world enterprise datasets. Current benchmarks, based on public data, overestimate the performance of LLMs. We release a new benchmark dataset, the GOBY Benchmark, to advance discovery in enterprise data integration. Based on our experience with this enterprise benchmark, we propose techniques to uplift the performance of LLMs on enterprise data, including (1) hierarchical annotation, (2) runtime class-learning, and (3) ontology synthesis. We show that, once these techniques are deployed, the performance on enterprise data becomes on par with that of public data. The Goby benchmark can be obtained at https://goby-benchmark.github.io/.

Mind the Data Gap: Bridging LLMs to Enterprise Data Integration

TL;DR

LLMs trained on public data underperform on real enterprise data for data integration due to dark data and distribution shifts. The authors introduce Goby, a private enterprise benchmark for semantic column-type annotation, and develop a hierarchical, ontology-based framework—including iterative dictionary construction, ontology synthesis, and full-context encoding—to uplift LLM performance. On Goby, raw LLM performance drops relative to public benchmarks, but tree-serialization-based encoding with full ontology context can reach scores around 0.85, approaching public-data levels. This work provides a realistic evaluation platform and practical methods for bridging the enterprise data gap, encouraging broader adoption of Goby for benchmarking and method development.

Abstract

Leading large language models (LLMs) are trained on public data. However, most of the world's data is dark data that is not publicly accessible, mainly in the form of private organizational or enterprise data. We show that the performance of methods based on LLMs seriously degrades when tested on real-world enterprise datasets. Current benchmarks, based on public data, overestimate the performance of LLMs. We release a new benchmark dataset, the GOBY Benchmark, to advance discovery in enterprise data integration. Based on our experience with this enterprise benchmark, we propose techniques to uplift the performance of LLMs on enterprise data, including (1) hierarchical annotation, (2) runtime class-learning, and (3) ontology synthesis. We show that, once these techniques are deployed, the performance on enterprise data becomes on par with that of public data. The Goby benchmark can be obtained at https://goby-benchmark.github.io/.
Paper Structure (20 sections, 5 figures, 4 tables)

This paper contains 20 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Semantic column label distribution in the Goby dataset.
  • Figure 2: Mechanism of grammar-constrained decoding (GCD). GCD is applied to ensure the LLM complies with the ontology to find for given data instance a correct hierarchical-classes. Possible classes are constrained to the given dictionaries and relations are based on the previous layer. During decoding only valid token continuations compliant with the grammar (shown in green) are considered. Tokens outside the grammar (shown in red) are rejected.
  • Figure 3: Tree-serialization approach. This figure shows the model prompt when serializing the ontology tree and presenting it to the model as one structure.
  • Figure 4: Step-by-step prompting. The model navigates a tree of potential prompts by making completions at each node. This figure shows the sequence of prompts used in this mode of operation, allowing the LLM to reason step-by-step. The prompt structure and table serialization are in line with those shown in Figure \ref{['fig:tree-serialization']}.
  • Figure 5: Example of the synthesized ontology. The base classes, in black, are first learned iteratively. The super-classes, in orange, and their relations are built in the ontology-learning step.