Table of Contents
Fetching ...

Group Testing for Accurate and Efficient Range-Based Near Neighbor Search for Plagiarism Detection

Harsh Shah, Kashish Mittal, Ajit Rajwade

TL;DR

The high recall of the technique makes it particularly suited to plagiarism detection scenarios where it is important to report every database item that is sufficiently similar item to the query.

Abstract

This work presents an adaptive group testing framework for the range-based high dimensional near neighbor search problem. Our method efficiently marks each item in a database as neighbor or non-neighbor of a query point, based on a cosine distance threshold without exhaustive search. Like other methods for large scale retrieval, our approach exploits the assumption that most of the items in the database are unrelated to the query. However, it does not assume a large difference between the cosine similarity of the query vector with the least related neighbor and that with the least unrelated non-neighbor. Following a multi-stage adaptive group testing algorithm based on binary splitting, we divide the set of items to be searched into half at each step, and perform dot product tests on smaller and smaller subsets, many of which we are able to prune away. We show that, using softmax-based features, our method achieves a more than ten-fold speed-up over exhaustive search with no loss of accuracy, on a variety of large datasets. Based on empirically verified models for the distribution of cosine distances, we present a theoretical analysis of the expected number of distance computations per query and the probability that a pool will be pruned. Our method has the following features: (i) It implicitly exploits useful distributional properties of cosine distances unlike other methods; (ii) All required data structures are created purely offline; (iii) It does not impose any strong assumptions on the number of true near neighbors; (iv) It is adaptable to streaming settings where new vectors are dynamically added to the database; and (v) It does not require any parameter tuning. The high recall of our technique makes it particularly suited to plagiarism detection scenarios where it is important to report every database item that is sufficiently similar item to the query.

Group Testing for Accurate and Efficient Range-Based Near Neighbor Search for Plagiarism Detection

TL;DR

The high recall of the technique makes it particularly suited to plagiarism detection scenarios where it is important to report every database item that is sufficiently similar item to the query.

Abstract

This work presents an adaptive group testing framework for the range-based high dimensional near neighbor search problem. Our method efficiently marks each item in a database as neighbor or non-neighbor of a query point, based on a cosine distance threshold without exhaustive search. Like other methods for large scale retrieval, our approach exploits the assumption that most of the items in the database are unrelated to the query. However, it does not assume a large difference between the cosine similarity of the query vector with the least related neighbor and that with the least unrelated non-neighbor. Following a multi-stage adaptive group testing algorithm based on binary splitting, we divide the set of items to be searched into half at each step, and perform dot product tests on smaller and smaller subsets, many of which we are able to prune away. We show that, using softmax-based features, our method achieves a more than ten-fold speed-up over exhaustive search with no loss of accuracy, on a variety of large datasets. Based on empirically verified models for the distribution of cosine distances, we present a theoretical analysis of the expected number of distance computations per query and the probability that a pool will be pruned. Our method has the following features: (i) It implicitly exploits useful distributional properties of cosine distances unlike other methods; (ii) All required data structures are created purely offline; (iii) It does not impose any strong assumptions on the number of true near neighbors; (iv) It is adaptable to streaming settings where new vectors are dynamically added to the database; and (v) It does not require any parameter tuning. The high recall of our technique makes it particularly suited to plagiarism detection scenarios where it is important to report every database item that is sufficiently similar item to the query.
Paper Structure (14 sections, 6 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 14 sections, 6 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: Normalized histograms (bar graphs in blue), and best-fit exponential distribution approximations (red curves), of dot products between feature vectors of query images (10,000 in number) and all images of MIRFLICKR, ImageNet, IMDB-Wiki and InstaCities databases (left to right, top to bottom). Also see Fig. 1 of suppmat for a plot of negative log of the histograms for clearer visualization of the smaller probability values.
  • Figure 2: Schematic of our GT algorithm for NN search
  • Figure 3: Negative logarithm of normalized histograms of dot products between feature vectors of query images (10K in number) and those of all gallery images of MIRFLICKR, ImageNet, IMDB-Wiki and InstaCities (left to right, top to bottom). Compare to Fig. 2 of the main paper.
  • Figure 4: Histograms of the number of pools pruned in every round of binary splitting for the Our-Sum method, for MIRFLICKR, ImageNet, IMDB-Wiki and InstaCities (left to right, top to bottom), all for $\rho \geq 0.7$.
  • Figure 5: Histograms of the number of pools pruned in every round of binary splitting for the Our-Max technique for MIRFLICKR, ImageNet, IMDB-Wiki and InstaCities (left to right, top to bottom), all for $\rho \geq 0.7$.
  • ...and 3 more figures