Table of Contents
Fetching ...

GRAM: Generative Retrieval Augmented Matching of Data Schemas in the Context of Data Security

Xuanqing Liu, Luyang Kong, Runhui Wang, Patrick Song, Austin Nevins, Henrik Johnson, Nimish Amlathe, Davor Golac

TL;DR

GRAM tackles privacy-preserving schema matching by leveraging instruction-finetuned LLMs with retrieval augmentation. It introduces NER-based destination filtering, Double-RAG prompting, plus object type and key detectors to deliver an end-to-end data-table ingestion service. Experiments show GRAM outperforms traditional baselines (e.g., 88.7% average accuracy) and that prompt compression and filtering can improve throughput, with a favorable shift toward zero-/few-shot inference. The work demonstrates practical applicability for secure data integration and highlights areas for robustness and scalability in real deployments.

Abstract

Schema matching constitutes a pivotal phase in the data ingestion process for contemporary database systems. Its objective is to discern pairwise similarities between two sets of attributes, each associated with a distinct data table. This challenge emerges at the initial stages of data analytics, such as when incorporating a third-party table into existing databases to inform business insights. Given its significance in the realm of database systems, schema matching has been under investigation since the 2000s. This study revisits this foundational problem within the context of large language models. Adhering to increasingly stringent data security policies, our focus lies on the zero-shot and few-shot scenarios: the model should analyze only a minimal amount of customer data to execute the matching task, contrasting with the conventional approach of scrutinizing the entire data table. We emphasize that the zero-shot or few-shot assumption is imperative to safeguard the identity and privacy of customer data, even at the potential cost of accuracy. The capability to accurately match attributes under such stringent requirements distinguishes our work from previous literature in this domain.

GRAM: Generative Retrieval Augmented Matching of Data Schemas in the Context of Data Security

TL;DR

GRAM tackles privacy-preserving schema matching by leveraging instruction-finetuned LLMs with retrieval augmentation. It introduces NER-based destination filtering, Double-RAG prompting, plus object type and key detectors to deliver an end-to-end data-table ingestion service. Experiments show GRAM outperforms traditional baselines (e.g., 88.7% average accuracy) and that prompt compression and filtering can improve throughput, with a favorable shift toward zero-/few-shot inference. The work demonstrates practical applicability for secure data integration and highlights areas for robustness and scalability in real deployments.

Abstract

Schema matching constitutes a pivotal phase in the data ingestion process for contemporary database systems. Its objective is to discern pairwise similarities between two sets of attributes, each associated with a distinct data table. This challenge emerges at the initial stages of data analytics, such as when incorporating a third-party table into existing databases to inform business insights. Given its significance in the realm of database systems, schema matching has been under investigation since the 2000s. This study revisits this foundational problem within the context of large language models. Adhering to increasingly stringent data security policies, our focus lies on the zero-shot and few-shot scenarios: the model should analyze only a minimal amount of customer data to execute the matching task, contrasting with the conventional approach of scrutinizing the entire data table. We emphasize that the zero-shot or few-shot assumption is imperative to safeguard the identity and privacy of customer data, even at the potential cost of accuracy. The capability to accurately match attributes under such stringent requirements distinguishes our work from previous literature in this domain.
Paper Structure (36 sections, 6 equations, 8 figures, 2 tables)

This paper contains 36 sections, 6 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Illustration of the idea of hierarchical prediction in schema mapping. First, columns of input data table are partitioned and grouped into one or more object types, here are Profile and Order (two ellipse shapes in figure). Next, we take a column from partition group, then the column traverses through the $n$-ary tree based on the classification results at each level, until a root node is found (marked in red arrows). Each root node corresponds to a target attribute defined by target schema. We repeat the same process for each column until all columns are mapped to target attributes.
  • Figure 2: An example of how an individual attribute in the schema look like. We highlight the required field (column name) with shades, and all other fields (data type, nullable, column meaning, values, length, etc.) as optional.
  • Figure 3: An illustrative example outlining the concept of prompting Large Language Models (LLMs) to match a source attribute (e.g., contact_name for Amazon.com Inc.) to a list of $15$ target attributes is provided for clarity.
  • Figure 4: Architecture and workflow of GRAM.
  • Figure 5: Inference speedup due to NER and double-RAG filters. With double-RAG filter, we keep $k_{\mathrm{opt}}=4$ options and $k_{\mathrm{ex}}=1$ examples for each of the $4$ options. Error bars are provided but barely visible.
  • ...and 3 more figures