Table of Contents
Fetching ...

PolyMinHash: Efficient Area-Based MinHashing of Polygons for Approximate Nearest Neighbor Search

Alima Subedi, Sankalpa Pokharel, Satish Puri

TL;DR

To address scalable polygon similarity search, the paper introduces PolyMinHash, a 2D polygon hashing scheme that yields short, similarity-preserving signatures by counting the number of random samples needed for a point to land inside a polygon, which corresponds to area-based Jaccard similarity $J(P,Q)$. The method centers polygons, builds a global bounding rectangle, and uses fixed seeds to generate $m$ samples per polygon, combining fast rejection tests with a final point-in-polygon check. Experiments on real-world GIS-style datasets show that PolyMinHash drastically reduces refinement candidates (up to 98% pruning) while maintaining competitive recall, demonstrating practical improvements for ANN in polygon databases. Overall, PolyMinHash extends MinHash to 2D polygon geometry, offering memory-efficient signatures and effective pruning for large-scale polygon similarity search.

Abstract

Similarity searches are a critical task in data mining. As data sets grow larger, exact nearest neighbor searches quickly become unfeasible, leading to the adoption of approximate nearest neighbor (ANN) searches. ANN has been studied for text data, images, and trajectories. However, there has been little effort to develop ANN systems for polygons in spatial database systems and geographic information systems. We present PolyMinHash, a system for approximate polygon similarity search that adapts MinHashing into a novel 2D polygon-hashing scheme to generate short, similarity-preserving signatures of input polygons. Minhash is generated by counting the number of randomly sampled points needed before the sampled point lands within the polygon's interior area, yielding hash values that preserve area-based Jaccard similarity. We present the tradeoff between search accuracy and runtime of our PolyMinHash system. Our hashing mechanism reduces the number of candidates to be processed in the query refinement phase by up to 98% compared to the number of candidates processed by the brute-force algorithm.

PolyMinHash: Efficient Area-Based MinHashing of Polygons for Approximate Nearest Neighbor Search

TL;DR

To address scalable polygon similarity search, the paper introduces PolyMinHash, a 2D polygon hashing scheme that yields short, similarity-preserving signatures by counting the number of random samples needed for a point to land inside a polygon, which corresponds to area-based Jaccard similarity . The method centers polygons, builds a global bounding rectangle, and uses fixed seeds to generate samples per polygon, combining fast rejection tests with a final point-in-polygon check. Experiments on real-world GIS-style datasets show that PolyMinHash drastically reduces refinement candidates (up to 98% pruning) while maintaining competitive recall, demonstrating practical improvements for ANN in polygon databases. Overall, PolyMinHash extends MinHash to 2D polygon geometry, offering memory-efficient signatures and effective pruning for large-scale polygon similarity search.

Abstract

Similarity searches are a critical task in data mining. As data sets grow larger, exact nearest neighbor searches quickly become unfeasible, leading to the adoption of approximate nearest neighbor (ANN) searches. ANN has been studied for text data, images, and trajectories. However, there has been little effort to develop ANN systems for polygons in spatial database systems and geographic information systems. We present PolyMinHash, a system for approximate polygon similarity search that adapts MinHashing into a novel 2D polygon-hashing scheme to generate short, similarity-preserving signatures of input polygons. Minhash is generated by counting the number of randomly sampled points needed before the sampled point lands within the polygon's interior area, yielding hash values that preserve area-based Jaccard similarity. We present the tradeoff between search accuracy and runtime of our PolyMinHash system. Our hashing mechanism reduces the number of candidates to be processed in the query refinement phase by up to 98% compared to the number of candidates processed by the brute-force algorithm.

Paper Structure

This paper contains 12 sections, 2 theorems, 9 equations, 4 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

For any two polygons P and Q, the probability of a hash collision equals the Jaccard similarity between the two polygons. where $J(P, Q)$ denotes the Jaccard similarity.

Figures (4)

  • Figure 1: An example of a K-Approximate nearest-neighbor (K-ANN) query and its results for K=3. JD is the Jaccard distance.
  • Figure 2: Overview of PolyMinHash system. (a) Center the polygons. (b) Construct the global MBR B and sample 2D points randomly within B for Polygon P. The red dot #6 is the first point that lands inside the polygon. Therefore, the MinHash(P) is 6.
  • Figure 3: Effect of MinHash length on MinHashing time, Refinement time, and Recall for Cemetery dataset.
  • Figure 4: (a) Relation between Recall and Pruning. (b) Effect of MinHash Length on Pruning.

Theorems & Definitions (6)

  • Definition 1: Nearest Neighbor Search
  • Definition 2: MinHash Function
  • Theorem 1
  • proof
  • Definition 3
  • Theorem 2