Table of Contents
Fetching ...

OpenForge: Probabilistic Metadata Integration

Tianji Cong, Fatemeh Nargesian, Junjie Xing, H. V. Jagadish

TL;DR

OpenForge addresses metadata integration by modeling relationships among metadata concepts as a MAP inference problem on a Markov Random Field, enforcing transitivity constraints via a shared ternary potential. It uses a two-stage approach: first aggregating priors from prompting/fine-tuning LLMs and traditional ML, then refining predictions with MAP inference on an MRF conditioned on observed evidence $\mathcal{E}$ and axioms $\mathcal{A}$ to yield the optimal relationship graph $\tilde{G}$ that maximizes $P(G|\mathcal{E},\mathcal{A})$. On three real-world datasets (SOTAB, Walmart-Amazon for equivalence; ICPSR for taxonomy), OpenForge consistently outperforms baselines including GPT-4 by substantial margins (up to 25 F1 points) and remains scalable with GPU-accelerated inference and local MRF decomposition. These results demonstrate a practical, scalable method to unify and maintain heterogeneous metadata vocabularies, improving findability and interoperability of data assets.

Abstract

Modern data stores increasingly rely on metadata for enabling diverse activities such as data cataloging and search. However, metadata curation remains a labor-intensive task, and the broader challenge of metadata maintenance -- ensuring its consistency, usefulness, and freshness -- has been largely overlooked. In this work, we tackle the problem of resolving relationships among metadata concepts from disparate sources. These relationships are critical for creating clean, consistent, and up-to-date metadata repositories, and a central challenge for metadata integration. We propose OpenForge, a two-stage prior-posterior framework for metadata integration. In the first stage, OpenForge exploits multiple methods including fine-tuned large language models to obtain prior beliefs about concept relationships. In the second stage, OpenForge refines these predictions by leveraging Markov Random Field, a probabilistic graphical model. We formalize metadata integration as an optimization problem, where the objective is to identify the relationship assignments that maximize the joint probability of assignments. The MRF formulation allows OpenForge to capture prior beliefs while encoding critical relationship properties, such as transitivity, in probabilistic inference. Experiments on real-world datasets demonstrate the effectiveness and efficiency of OpenForge. On a use case of matching two metadata vocabularies, OpenForge outperforms GPT-4, the second-best method, by 25 F1-score points.

OpenForge: Probabilistic Metadata Integration

TL;DR

OpenForge addresses metadata integration by modeling relationships among metadata concepts as a MAP inference problem on a Markov Random Field, enforcing transitivity constraints via a shared ternary potential. It uses a two-stage approach: first aggregating priors from prompting/fine-tuning LLMs and traditional ML, then refining predictions with MAP inference on an MRF conditioned on observed evidence and axioms to yield the optimal relationship graph that maximizes . On three real-world datasets (SOTAB, Walmart-Amazon for equivalence; ICPSR for taxonomy), OpenForge consistently outperforms baselines including GPT-4 by substantial margins (up to 25 F1 points) and remains scalable with GPU-accelerated inference and local MRF decomposition. These results demonstrate a practical, scalable method to unify and maintain heterogeneous metadata vocabularies, improving findability and interoperability of data assets.

Abstract

Modern data stores increasingly rely on metadata for enabling diverse activities such as data cataloging and search. However, metadata curation remains a labor-intensive task, and the broader challenge of metadata maintenance -- ensuring its consistency, usefulness, and freshness -- has been largely overlooked. In this work, we tackle the problem of resolving relationships among metadata concepts from disparate sources. These relationships are critical for creating clean, consistent, and up-to-date metadata repositories, and a central challenge for metadata integration. We propose OpenForge, a two-stage prior-posterior framework for metadata integration. In the first stage, OpenForge exploits multiple methods including fine-tuned large language models to obtain prior beliefs about concept relationships. In the second stage, OpenForge refines these predictions by leveraging Markov Random Field, a probabilistic graphical model. We formalize metadata integration as an optimization problem, where the objective is to identify the relationship assignments that maximize the joint probability of assignments. The MRF formulation allows OpenForge to capture prior beliefs while encoding critical relationship properties, such as transitivity, in probabilistic inference. Experiments on real-world datasets demonstrate the effectiveness and efficiency of OpenForge. On a use case of matching two metadata vocabularies, OpenForge outperforms GPT-4, the second-best method, by 25 F1-score points.

Paper Structure

This paper contains 37 sections, 5 equations, 9 figures, 3 tables, 1 algorithm.

Figures (9)

  • Figure 1: Illustration of metadata integration problem.
  • Figure 2: Illustration of the relationship transitivity (left) and inconsistent relationship assignments that violate the transitivity (right). Green edges indicate correct predictions and red edges indicate conflicting predictions.
  • Figure 3: Overview of the proposed two-stage prior-posterior framework for integrating metadata concepts.
  • Figure 4: The plot on the left demonstrates an instance of our proposed MRF model containing six nodes/random variables and their dependencies; the plot on the right, known as a factor graph, visualizes the correspondence between factors and random variables in the MRF.
  • Figure 5: Creating independent MRFs for concept pairs in large datasets with sparse relationships. Ordered pairs are first grouped by the left concept and top-$k$ neighbors (represented by the purple semi-circles) are retrieved for the left concept to construct a local MRF of random variables. Inference over independent MRFs is parallelized on available CPUs and posterior predictions of test pairs are collected.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Definition 1: Probability of a Relationship Assignment Graph
  • Definition 2: Optimal Relationship Graph