Class Agnostic Instance-level Descriptor for Visual Instance Search
Qi-Ying Sun, Wan-Lei Zhao, Hui-Ying Xie, Yi-Bo Miao, Chong-Wah Ngo
TL;DR
CLAID tackles the instance search problem by building a class-agnostic, multi-granularity descriptor from self-supervised ViT features. It uses a top-down hierarchical bisecting clustering with adaptive termination and dummy-node filtering to produce about 30 region-level descriptors per image, which are then pooled via RoI pooling to form a uniform, class-agnostic feature set for image and instance retrieval. The key contributions are the hierarchical detector, the termination and dummy-node mechanisms, and extensive experiments showing strong performance on instance search and image retrieval across multiple benchmarks, with backbone-agnostic compatibility including SigLIP. This approach delivers scalable, robust instance localization under occlusion and unknown categories, enabling integrated search systems that span text-to-instance search, instance search, and image retrieval.
Abstract
Despite the great success of the deep features in content-based image retrieval, the visual instance search remains challenging due to the lack of effective instance-level feature representation. Supervised or weakly supervised object detection methods are not the appropriate solutions due to their poor performance on the unknown object categories. In this paper, based on the feature set output from self-supervised ViT, the instance-level region discovery is modeled as detecting the compact feature subsets in a hierarchical fashion. The hierarchical decomposition results in a hierarchy of instance regions. On the one hand, this kind of hierarchical decomposition well addresses the problem of object embedding and occlusions, which are widely observed in real scenarios. On the other hand, the non-leaf nodes and leaf nodes on the hierarchy correspond to the instance regions in different granularities within an image. Therefore, features in uniform length are produced for these instance regions, which may cover across a dominant image region, an integral of multiple instances, or various individual instances. Such a collection of features allows us to unify the image retrieval, multi-instance search, and instance search into one framework. The empirical studies on three benchmarks show that such an instance-level descriptor remains effective on both the known and unknown object categories. Moreover, the superior performance is achieved on single-instance and multi-instance search, as well as image retrieval tasks.
