Table of Contents
Fetching ...

Xling: A Learned Filter Framework for Accelerating High-Dimensional Approximate Similarity Join

Yifan Wang, Vyom Pathak, Daisy Zhe Wang

TL;DR

Xling is proposed, a generic framework to build a learning-based metric space filter with any existing regression model, aiming at accurately predicting whether a query point has enough number of neighbors, and provides a suite of optimization strategies to further improve the prediction quality based on the learning model.

Abstract

Similarity join finds all pairs of close points within a given distance threshold. Many similarity join methods have been proposed, but they are usually not efficient on high-dimensional space due to the curse of dimensionality and data-unawareness. We investigate the possibility of using metric space Bloom filter (MSBF), a family of data structures checking if a query point has neighbors in a multi-dimensional space, to speed up similarity join. However, there are several challenges when applying MSBF to similarity join, including excessive information loss, data-unawareness and hard constraint on the distance metric. In this paper, we propose Xling, a generic framework to build a learning-based metric space filter with any existing regression model, aiming at accurately predicting whether a query point has enough number of neighbors. The framework provides a suite of optimization strategies to further improve the prediction quality based on the learning model, which has demonstrated significantly higher prediction quality than existing MSBF. We also propose XJoin, one of the first filter-based similarity join methods, based on Xling. By predicting and skipping those queries without enough neighbors, XJoin can effectively reduce unnecessary neighbor searching and therefore it achieves a remarkable acceleration. Benefiting from the generalization capability of deep learning models, XJoin can be easily transferred onto new dataset (in similar distribution) without re-training. Furthermore, Xling is not limited to being applied in XJoin, instead, it acts as a flexible plugin that can be inserted to any loop-based similarity join methods for a speedup.

Xling: A Learned Filter Framework for Accelerating High-Dimensional Approximate Similarity Join

TL;DR

Xling is proposed, a generic framework to build a learning-based metric space filter with any existing regression model, aiming at accurately predicting whether a query point has enough number of neighbors, and provides a suite of optimization strategies to further improve the prediction quality based on the learning model.

Abstract

Similarity join finds all pairs of close points within a given distance threshold. Many similarity join methods have been proposed, but they are usually not efficient on high-dimensional space due to the curse of dimensionality and data-unawareness. We investigate the possibility of using metric space Bloom filter (MSBF), a family of data structures checking if a query point has neighbors in a multi-dimensional space, to speed up similarity join. However, there are several challenges when applying MSBF to similarity join, including excessive information loss, data-unawareness and hard constraint on the distance metric. In this paper, we propose Xling, a generic framework to build a learning-based metric space filter with any existing regression model, aiming at accurately predicting whether a query point has enough number of neighbors. The framework provides a suite of optimization strategies to further improve the prediction quality based on the learning model, which has demonstrated significantly higher prediction quality than existing MSBF. We also propose XJoin, one of the first filter-based similarity join methods, based on Xling. By predicting and skipping those queries without enough neighbors, XJoin can effectively reduce unnecessary neighbor searching and therefore it achieves a remarkable acceleration. Benefiting from the generalization capability of deep learning models, XJoin can be easily transferred onto new dataset (in similar distribution) without re-training. Furthermore, Xling is not limited to being applied in XJoin, instead, it acts as a flexible plugin that can be inserted to any loop-based similarity join methods for a speedup.
Paper Structure (22 sections, 2 equations, 4 figures, 6 tables, 1 algorithm)

This paper contains 22 sections, 2 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: Architecture and workflow of Xling
  • Figure 2: End-to-end query processing time and recall for all the similarity join methods on all datasets, where the figures of FastText and NUS-WIDE do not include SuperEGO, as it cannot run on these two datasets.
  • Figure 3: Speed-quality trade-off curves for XJoin, the approximate methods and their Xling-enhanced versions on the selected datasets and $\epsilon$, and other cases are also similar.
  • Figure 4: Speed-quality trade-off curves for XJoin, the approximate methods and their Xling-enhanced versions on the second 150k datasets, where all Xlings are pre-trained on the original 150k dataset without re-training for the second

Theorems & Definitions (4)

  • Definition 1: range search
  • Definition 2: similarity join
  • Definition 3: Condition-based Regression problem
  • Definition 4: Training condition selection for CR