Table of Contents
Fetching ...

LIST: Learning to Index Spatio-Textual Data for Embedding based Spatial Keyword Queries

Ziqi Yin, Shanshan Feng, Shang Liu, Gao Cong, Yew Soon Ong, Bin Cui

TL;DR

A lightweight embedding based spatial relevance model that can integrate with various text relevance models to form a lightweight yet effective relevance for reranking objects retrieved by LIST, a novel machine learning based Approximate Nearest Neighbor Search index that Learns to Index the Spatio-Textual data.

Abstract

With the proliferation of spatio-textual data, Top-k KNN spatial keyword queries (TkQs), which return a list of objects based on a ranking function that considers both spatial and textual relevance, have found many real-life applications. To efficiently handle TkQs, many indexes have been developed, but the effectiveness of TkQ is limited. To improve effectiveness, several deep learning models have recently been proposed, but they suffer severe efficiency issues and there are no efficient indexes specifically designed to accelerate the top-k search process for these deep learning models. To tackle these issues, we consider embedding based spatial keyword queries, which capture the semantic meaning of query keywords and object descriptions in two separate embeddings to evaluate textual relevance. Although various models can be used to generate these embeddings, no indexes have been specifically designed for such queries. To fill this gap, we propose LIST, a novel machine learning based Approximate Nearest Neighbor Search index that Learns to Index the Spatio-Textual data. LIST utilizes a new learning-to-cluster technique to group relevant queries and objects together while separating irrelevant queries and objects. There are two key challenges in building an effective and efficient index, i.e., the absence of high-quality labels and the unbalanced clustering results. We develop a novel pseudo-label generation technique to address the two challenges. Additionally, we introduce a learning based spatial relevance model that can integrates with various text relevance models to form a lightweight yet effective relevance for reranking objects retrieved by LIST.

LIST: Learning to Index Spatio-Textual Data for Embedding based Spatial Keyword Queries

TL;DR

A lightweight embedding based spatial relevance model that can integrate with various text relevance models to form a lightweight yet effective relevance for reranking objects retrieved by LIST, a novel machine learning based Approximate Nearest Neighbor Search index that Learns to Index the Spatio-Textual data.

Abstract

With the proliferation of spatio-textual data, Top-k KNN spatial keyword queries (TkQs), which return a list of objects based on a ranking function that considers both spatial and textual relevance, have found many real-life applications. To efficiently handle TkQs, many indexes have been developed, but the effectiveness of TkQ is limited. To improve effectiveness, several deep learning models have recently been proposed, but they suffer severe efficiency issues and there are no efficient indexes specifically designed to accelerate the top-k search process for these deep learning models. To tackle these issues, we consider embedding based spatial keyword queries, which capture the semantic meaning of query keywords and object descriptions in two separate embeddings to evaluate textual relevance. Although various models can be used to generate these embeddings, no indexes have been specifically designed for such queries. To fill this gap, we propose LIST, a novel machine learning based Approximate Nearest Neighbor Search index that Learns to Index the Spatio-Textual data. LIST utilizes a new learning-to-cluster technique to group relevant queries and objects together while separating irrelevant queries and objects. There are two key challenges in building an effective and efficient index, i.e., the absence of high-quality labels and the unbalanced clustering results. We develop a novel pseudo-label generation technique to address the two challenges. Additionally, we introduce a learning based spatial relevance model that can integrates with various text relevance models to form a lightweight yet effective relevance for reranking objects retrieved by LIST.
Paper Structure (18 sections, 17 equations, 12 figures, 6 tables, 1 algorithm)

This paper contains 18 sections, 17 equations, 12 figures, 6 tables, 1 algorithm.

Figures (12)

  • Figure 1: Figure \ref{['fig:term-match']} shows the percentage distribution of ground-truth positive query-object pairs based on the number of matching terms on the Beijing dataset. Figure \ref{['fig:cdf_distribution']} compares the CDF of spatial distance for ground-truth positive query-object pairs and the linear distribution on the Beijing dataset.
  • Figure 2: The three phases of our retriever LIST: the training, indexing, and query phase. The relevance model is shown in yellow and the index is shown in green.
  • Figure 3: The illustration of the index.
  • Figure 4: The effectiveness-efficiency trade-off results (upper and right is better).
  • Figure 5: The Effectiveness-Speed trade-off curve varies with the number of objects retrieved (top-$k$) and the number of clusters to route ($cr$) (up and right is better).
  • ...and 7 more figures