Gem: Gaussian Mixture Model Embeddings for Numerical Feature Distributions

Hafiz Tayyab Rauf; Alex Bogatu; Norman W. Paton; Andre Freitas

Gem: Gaussian Mixture Model Embeddings for Numerical Feature Distributions

Hafiz Tayyab Rauf, Alex Bogatu, Norman W. Paton, Andre Freitas

TL;DR

This paper proposes a method called Gem (Gaussian mixture model embeddings) that creates embeddings that build on numerical value distributions from columns using a Gaussian Mixture Model to identify and cluster columns with similar value distributions.

Abstract

Embeddings are now used to underpin a wide variety of data management tasks, including entity resolution, dataset search and semantic type detection. Such applications often involve datasets with numerical columns, but there has been more emphasis placed on the semantics of categorical data in embeddings than on the distinctive features of numerical data. In this paper, we propose a method called Gem (Gaussian mixture model embeddings) that creates embeddings that build on numerical value distributions from columns. The proposed method specializes a Gaussian Mixture Model (GMM) to identify and cluster columns with similar value distributions. We introduce a signature mechanism that generates a probability matrix for each column, indicating its likelihood of belonging to specific Gaussian components, which can be used for different applications, such as to determine semantic types. Finally, we generate embeddings for three numerical data properties: distributional, statistical, and contextual. Our core method focuses solely on numerical columns without using table names or neighboring columns for context. However, the method can be combined with other types of evidence, and we later integrate attribute names with the Gaussian embeddings to evaluate the method's contribution to improving overall performance. We compare Gem with several baseline methods for numeric only and numeric + context tasks, showing that Gem consistently outperforms the baselines on four benchmark datasets.

Gem: Gaussian Mixture Model Embeddings for Numerical Feature Distributions

TL;DR

Abstract

Gem: Gaussian Mixture Model Embeddings for Numerical Feature Distributions

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)