Table of Contents
Fetching ...

On Train-Test Class Overlap and Detection for Image Retrieval

Chull Hwan Song, Jooyoung Yoon, Taebaek Hwang, Shunghyun Choi, Yeong Hyeon Gu, Yannis Avrithis

TL;DR

This work investigates train-test class overlap in image retrieval, demonstrating that overlapping landmarks between training and evaluation sets can artificially inflate performance. It introduces $\ ext{$\mathcal{R}$GLDv2-clean}$, a revisited training set with overlapped landmark categories removed to enable fair benchmarking against evaluation sets $\ ext{$\mathcal{R}$Oxford}$ and $\ ext{$\mathcal{R}$Paris}$. Simultaneously, it proposes CiDeR, a single-stage, end-to-end detect-to-retrieve pipeline that forgoes location supervision by using attentional localization to isolate objects of interest and produce robust global descriptors. Across extensive experiments, CiDeR yields competitive or state-of-the-art results on existing clean datasets and significantly degrades when overlaps are removed, underscoring the importance of clean data; with fine-tuning, CiDeR sets new SOTA on the revisited clean dataset with and without distractors. Together, the data cleaning and the one-stage CiDeR approach offer a practical path to fair, scalable, and efficient instance-level image retrieval.

Abstract

How important is it for training and evaluation sets to not have class overlap in image retrieval? We revisit Google Landmarks v2 clean, the most popular training set, by identifying and removing class overlap with Revisited Oxford and Paris [34], the most popular evaluation set. By comparing the original and the new RGLDv2-clean on a benchmark of reproduced state-of-the-art methods, our findings are striking. Not only is there a dramatic drop in performance, but it is inconsistent across methods, changing the ranking.What does it take to focus on objects or interest and ignore background clutter when indexing? Do we need to train an object detector and the representation separately? Do we need location supervision? We introduce Single-stage Detect-to-Retrieve (CiDeR), an end-to-end, single-stage pipeline to detect objects of interest and extract a global image representation. We outperform previous state-of-the-art on both existing training sets and the new RGLDv2-clean. Our dataset is available at https://github.com/dealicious-inc/RGLDv2-clean.

On Train-Test Class Overlap and Detection for Image Retrieval

TL;DR

This work investigates train-test class overlap in image retrieval, demonstrating that overlapping landmarks between training and evaluation sets can artificially inflate performance. It introduces \mathcal{R}, a revisited training set with overlapped landmark categories removed to enable fair benchmarking against evaluation sets \mathcal{R} and \mathcal{R}. Simultaneously, it proposes CiDeR, a single-stage, end-to-end detect-to-retrieve pipeline that forgoes location supervision by using attentional localization to isolate objects of interest and produce robust global descriptors. Across extensive experiments, CiDeR yields competitive or state-of-the-art results on existing clean datasets and significantly degrades when overlaps are removed, underscoring the importance of clean data; with fine-tuning, CiDeR sets new SOTA on the revisited clean dataset with and without distractors. Together, the data cleaning and the one-stage CiDeR approach offer a practical path to fair, scalable, and efficient instance-level image retrieval.

Abstract

How important is it for training and evaluation sets to not have class overlap in image retrieval? We revisit Google Landmarks v2 clean, the most popular training set, by identifying and removing class overlap with Revisited Oxford and Paris [34], the most popular evaluation set. By comparing the original and the new RGLDv2-clean on a benchmark of reproduced state-of-the-art methods, our findings are striking. Not only is there a dramatic drop in performance, but it is inconsistent across methods, changing the ranking.What does it take to focus on objects or interest and ignore background clutter when indexing? Do we need to train an object detector and the representation separately? Do we need location supervision? We introduce Single-stage Detect-to-Retrieve (CiDeR), an end-to-end, single-stage pipeline to detect objects of interest and extract a global image representation. We outperform previous state-of-the-art on both existing training sets and the new RGLDv2-clean. Our dataset is available at https://github.com/dealicious-inc/RGLDv2-clean.
Paper Structure (45 sections, 9 equations, 10 figures, 23 tables)

This paper contains 45 sections, 9 equations, 10 figures, 23 tables.

Figures (10)

  • Figure 1: It is beneficial for image retrieval to detect objects of interest in database images and only represent those. (a) Two-stage pipeline. Previous works involve two-stage embedding extraction at indexing, or a two-stage training process, and they may use location supervision or not. (b) One-stage pipeline. We use a single-stage embedding extraction at training and indexing; training is end-to-end and uses no location supervision.
  • Figure 2: Confirming overlapping landmark categories between training sets (GLDv2-clean, NC-clean, SfM-120k) and evaluation sets ($\mathcal{R}$Oxford, $\mathcal{R}$Paris). Red box: query image. The query image from the evaluation set in each box/row is followed by top-5 most similar images from the training set. Pink box: training image landmark identical with query (evaluation) image landmark. More examples can be found in the Appendix.
  • Figure 3: Ranking and verification pipeline to remove landmark categories from GLDv2-clean that overlap with those of the $\mathcal{R}$Oxf and $\mathcal{R}$Par evaluation sets and obtain the revisited version, $\mathcal{R}$GLDv2-clean.
  • Figure 4: Attentional localization (AL). Given a feature tensor $\mathbf{F} \in \mathbb{R}^{w \times h \times d}$, we obtain a spatial attention map $A \in \mathbb{R}^{w \times h}$ (\ref{['eq:attn']}) and we apply multiple thresholding operations to obtain a sequence of masks $M_1, \dots M_T$ (\ref{['eq:mask']}). The masks are applied independently to $\mathbf{F}$ and the resulting tensors are fused into a single tensor $\mathbf{F}^\ell$ by a convex combination with learnable weights $w_1, \dots, w_T$ (\ref{['eq:alm']}).
  • Figure 5: Attentional localization (AL). (a) Spatial attention map $A$ (\ref{['eq:attn']}) learned on frozen ResNet101, as pre-trained on ImageNet. (b) Same, but with the network fine-tuned on $\mathcal{R}$GLDv2-clean. (c) Binary mask $M_i$ (\ref{['eq:mask']}) for $i=2$, with $\beta = 0$ for visualization. (d) Detected regions as bounding boxes of connected components of $M_i$, overlaid on input image (in blue).
  • ...and 5 more figures