On Train-Test Class Overlap and Detection for Image Retrieval
Chull Hwan Song, Jooyoung Yoon, Taebaek Hwang, Shunghyun Choi, Yeong Hyeon Gu, Yannis Avrithis
TL;DR
This work investigates train-test class overlap in image retrieval, demonstrating that overlapping landmarks between training and evaluation sets can artificially inflate performance. It introduces $\ ext{$\mathcal{R}$GLDv2-clean}$, a revisited training set with overlapped landmark categories removed to enable fair benchmarking against evaluation sets $\ ext{$\mathcal{R}$Oxford}$ and $\ ext{$\mathcal{R}$Paris}$. Simultaneously, it proposes CiDeR, a single-stage, end-to-end detect-to-retrieve pipeline that forgoes location supervision by using attentional localization to isolate objects of interest and produce robust global descriptors. Across extensive experiments, CiDeR yields competitive or state-of-the-art results on existing clean datasets and significantly degrades when overlaps are removed, underscoring the importance of clean data; with fine-tuning, CiDeR sets new SOTA on the revisited clean dataset with and without distractors. Together, the data cleaning and the one-stage CiDeR approach offer a practical path to fair, scalable, and efficient instance-level image retrieval.
Abstract
How important is it for training and evaluation sets to not have class overlap in image retrieval? We revisit Google Landmarks v2 clean, the most popular training set, by identifying and removing class overlap with Revisited Oxford and Paris [34], the most popular evaluation set. By comparing the original and the new RGLDv2-clean on a benchmark of reproduced state-of-the-art methods, our findings are striking. Not only is there a dramatic drop in performance, but it is inconsistent across methods, changing the ranking.What does it take to focus on objects or interest and ignore background clutter when indexing? Do we need to train an object detector and the representation separately? Do we need location supervision? We introduce Single-stage Detect-to-Retrieve (CiDeR), an end-to-end, single-stage pipeline to detect objects of interest and extract a global image representation. We outperform previous state-of-the-art on both existing training sets and the new RGLDv2-clean. Our dataset is available at https://github.com/dealicious-inc/RGLDv2-clean.
