Table of Contents
Fetching ...

Discovering Data Structures: Nearest Neighbor Search and Beyond

Omar Salemohamed, Laurent Charlin, Shivam Garg, Vatsal Sharan, Gregory Valiant

TL;DR

This work proposes a general framework for end-to-end learning of data structures that adapts to the underlying data distribution and provides fine-grained control over query and space complexity and applies this framework to the problem of nearest neighbor search.

Abstract

We propose a general framework for end-to-end learning of data structures. Our framework adapts to the underlying data distribution and provides fine-grained control over query and space complexity. Crucially, the data structure is learned from scratch, and does not require careful initialization or seeding with candidate data structures/algorithms. We first apply this framework to the problem of nearest neighbor search. In several settings, we are able to reverse-engineer the learned data structures and query algorithms. For 1D nearest neighbor search, the model discovers optimal distribution (in)dependent algorithms such as binary search and variants of interpolation search. In higher dimensions, the model learns solutions that resemble k-d trees in some regimes, while in others, they have elements of locality-sensitive hashing. The model can also learn useful representations of high-dimensional data and exploit them to design effective data structures. We also adapt our framework to the problem of estimating frequencies over a data stream, and believe it could also be a powerful discovery tool for new problems.

Discovering Data Structures: Nearest Neighbor Search and Beyond

TL;DR

This work proposes a general framework for end-to-end learning of data structures that adapts to the underlying data distribution and provides fine-grained control over query and space complexity and applies this framework to the problem of nearest neighbor search.

Abstract

We propose a general framework for end-to-end learning of data structures. Our framework adapts to the underlying data distribution and provides fine-grained control over query and space complexity. Crucially, the data structure is learned from scratch, and does not require careful initialization or seeding with candidate data structures/algorithms. We first apply this framework to the problem of nearest neighbor search. In several settings, we are able to reverse-engineer the learned data structures and query algorithms. For 1D nearest neighbor search, the model discovers optimal distribution (in)dependent algorithms such as binary search and variants of interpolation search. In higher dimensions, the model learns solutions that resemble k-d trees in some regimes, while in others, they have elements of locality-sensitive hashing. The model can also learn useful representations of high-dimensional data and exploit them to design effective data structures. We also adapt our framework to the problem of estimating frequencies over a data stream, and believe it could also be a powerful discovery tool for new problems.

Paper Structure

This paper contains 53 sections, 25 figures.

Figures (25)

  • Figure 1: Our model has two components: 1) A data-processing network that transforms raw data into structured data, arranging it for efficient querying and generating additional statistics when given extra space (not shown in the figure). 2) A query-execution network that performs $M$ lookups into the output of the data-processing network in order to retrieve the answer to some query $q$. Each lookup $i$ is managed by a separate query model $Q^i$, which takes $q$ and the lookup history $H_i$, and outputs a one-hot lookup vector $m_i$ indicating the position to query.
  • Figure 2: (Left) Our model (E2E) trained with 1D data from the uniform distribution over $(-1, 1)$ outperforms binary search and several ablations. (Center) Distribution of lookups by the first query model. Unlike binary search, the model does not always start in the middle but rather closer to the query's likely position in the sorted data. (Right) When trained on data from a "hard" distribution for which the query value does not reveal information about the query's relative position, the model finds a solution similar to binary search. The figure shows an example of the model performing binary search ('X' denotes the nearest neighbor location).
  • Figure 3: For 1D Zipfian query distribution, our model performs slightly better than the the learning-augmented treap algorithm from hsu2018learningbased and both methods significantly outperforms binary search.
  • Figure 4: Our model's learned data structure for an instance from the uniform distribution in 2D. While the original order of the stored points showed no structure, the learned data structure arranges points that are close together in the Euclidean plane next to each other.
  • Figure 5: The learned data structure resembles a k-d tree in 2D. We show the average pairwise distances (along the first, second, and both dimensions) between points for the learned structure and the k-d tree, with darker colors indicating smaller distances. For the k-d tree, we arrange the points by in-order traversal. It recursively splits the points into two groups based on whether their value is smaller or larger than the median along a given dimension, alternating between dimensions at each level, starting with dimension 1. The learned data structure approximately mirrors this pattern, splitting by dimension 2 followed by dimension 1.
  • ...and 20 more figures