Table of Contents
Fetching ...

Selecting Walk Schemes for Database Embedding

Yuval Lev Lubarsky, Jan Tönshoff, Martin Grohe, Benny Kimelfeld

TL;DR

The paper studies embedding relational database tuples into vector spaces via walk-based schemas and introduces scheme selection to speed FoRWaRD in dynamic databases. It defines three scheme-selection families—FoRWaRD-Less, Light Training, and Online Elimination—and demonstrates that focusing on a small, informative subset of targeted walk schemes can yield up to multi-fold speedups with little to no loss in downstream accuracy, and even improve it in some cases. The dominant finding is that kernel-variance-based selection often provides the best trade-off between speed and quality, while maintaining extensibility to newly inserted tuples. The work suggests broader applicability to other sequence-based database embeddings and hints at future integration with data-augmentation and cross-database alignment tasks.

Abstract

Machinery for data analysis often requires a numeric representation of the input. Towards that, a common practice is to embed components of structured data into a high-dimensional vector space. We study the embedding of the tuples of a relational database, where existing techniques are often based on optimization tasks over a collection of random walks from the database. The focus of this paper is on the recent FoRWaRD algorithm that is designed for dynamic databases, where walks are sampled by following foreign keys between tuples. Importantly, different walks have different schemas, or "walk schemes", that are derived by listing the relations and attributes along the walk. Also importantly, different walk schemes describe relationships of different natures in the database. We show that by focusing on a few informative walk schemes, we can obtain tuple embedding significantly faster, while retaining the quality. We define the problem of scheme selection for tuple embedding, devise several approaches and strategies for scheme selection, and conduct a thorough empirical study of the performance over a collection of downstream tasks. Our results confirm that with effective strategies for scheme selection, we can obtain high-quality embeddings considerably (e.g., three times) faster, preserve the extensibility to newly inserted tuples, and even achieve an increase in the precision of some tasks.

Selecting Walk Schemes for Database Embedding

TL;DR

The paper studies embedding relational database tuples into vector spaces via walk-based schemas and introduces scheme selection to speed FoRWaRD in dynamic databases. It defines three scheme-selection families—FoRWaRD-Less, Light Training, and Online Elimination—and demonstrates that focusing on a small, informative subset of targeted walk schemes can yield up to multi-fold speedups with little to no loss in downstream accuracy, and even improve it in some cases. The dominant finding is that kernel-variance-based selection often provides the best trade-off between speed and quality, while maintaining extensibility to newly inserted tuples. The work suggests broader applicability to other sequence-based database embeddings and hints at future integration with data-augmentation and cross-database alignment tasks.

Abstract

Machinery for data analysis often requires a numeric representation of the input. Towards that, a common practice is to embed components of structured data into a high-dimensional vector space. We study the embedding of the tuples of a relational database, where existing techniques are often based on optimization tasks over a collection of random walks from the database. The focus of this paper is on the recent FoRWaRD algorithm that is designed for dynamic databases, where walks are sampled by following foreign keys between tuples. Importantly, different walks have different schemas, or "walk schemes", that are derived by listing the relations and attributes along the walk. Also importantly, different walk schemes describe relationships of different natures in the database. We show that by focusing on a few informative walk schemes, we can obtain tuple embedding significantly faster, while retaining the quality. We define the problem of scheme selection for tuple embedding, devise several approaches and strategies for scheme selection, and conduct a thorough empirical study of the performance over a collection of downstream tasks. Our results confirm that with effective strategies for scheme selection, we can obtain high-quality embeddings considerably (e.g., three times) faster, preserve the extensibility to newly inserted tuples, and even achieve an increase in the precision of some tasks.
Paper Structure (17 sections, 9 equations, 8 figures, 3 tables)

This paper contains 17 sections, 9 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: FoRWaRD vs. FoRWaRD with scheme selection
  • Figure 2: Religion prediction over the Mondial dataset with FoRWaRD and scheme selection via kernel variance. With a fifth of the walk schemes, we get to full equality in about one-third of the embedding time and eventually even outperform the embedding with the entire set of walk schemes.
  • Figure 3: FoRWaRD vs. FoRWaRD with scheme selection
  • Figure 4: Example of a database, with foreign-key constraints, taken from the Mondial dataset.
  • Figure 5: Examples of targeted walk schemes of length one to four, for the database schema of Figure \ref{['fig:dbexample']}. All walk schemes start at the Country relation. The figure of the walk scheme for $(s,A)$ shows $s$ as a path of rectangles and $A$ (e.g. name) as an attribute under the rightmost (last) rectangle.
  • ...and 3 more figures