Table of Contents
Fetching ...

Class Agnostic Instance-level Descriptor for Visual Instance Search

Qi-Ying Sun, Wan-Lei Zhao, Hui-Ying Xie, Yi-Bo Miao, Chong-Wah Ngo

TL;DR

CLAID tackles the instance search problem by building a class-agnostic, multi-granularity descriptor from self-supervised ViT features. It uses a top-down hierarchical bisecting clustering with adaptive termination and dummy-node filtering to produce about 30 region-level descriptors per image, which are then pooled via RoI pooling to form a uniform, class-agnostic feature set for image and instance retrieval. The key contributions are the hierarchical detector, the termination and dummy-node mechanisms, and extensive experiments showing strong performance on instance search and image retrieval across multiple benchmarks, with backbone-agnostic compatibility including SigLIP. This approach delivers scalable, robust instance localization under occlusion and unknown categories, enabling integrated search systems that span text-to-instance search, instance search, and image retrieval.

Abstract

Despite the great success of the deep features in content-based image retrieval, the visual instance search remains challenging due to the lack of effective instance-level feature representation. Supervised or weakly supervised object detection methods are not the appropriate solutions due to their poor performance on the unknown object categories. In this paper, based on the feature set output from self-supervised ViT, the instance-level region discovery is modeled as detecting the compact feature subsets in a hierarchical fashion. The hierarchical decomposition results in a hierarchy of instance regions. On the one hand, this kind of hierarchical decomposition well addresses the problem of object embedding and occlusions, which are widely observed in real scenarios. On the other hand, the non-leaf nodes and leaf nodes on the hierarchy correspond to the instance regions in different granularities within an image. Therefore, features in uniform length are produced for these instance regions, which may cover across a dominant image region, an integral of multiple instances, or various individual instances. Such a collection of features allows us to unify the image retrieval, multi-instance search, and instance search into one framework. The empirical studies on three benchmarks show that such an instance-level descriptor remains effective on both the known and unknown object categories. Moreover, the superior performance is achieved on single-instance and multi-instance search, as well as image retrieval tasks.

Class Agnostic Instance-level Descriptor for Visual Instance Search

TL;DR

CLAID tackles the instance search problem by building a class-agnostic, multi-granularity descriptor from self-supervised ViT features. It uses a top-down hierarchical bisecting clustering with adaptive termination and dummy-node filtering to produce about 30 region-level descriptors per image, which are then pooled via RoI pooling to form a uniform, class-agnostic feature set for image and instance retrieval. The key contributions are the hierarchical detector, the termination and dummy-node mechanisms, and extensive experiments showing strong performance on instance search and image retrieval across multiple benchmarks, with backbone-agnostic compatibility including SigLIP. This approach delivers scalable, robust instance localization under occlusion and unknown categories, enabling integrated search systems that span text-to-instance search, instance search, and image retrieval.

Abstract

Despite the great success of the deep features in content-based image retrieval, the visual instance search remains challenging due to the lack of effective instance-level feature representation. Supervised or weakly supervised object detection methods are not the appropriate solutions due to their poor performance on the unknown object categories. In this paper, based on the feature set output from self-supervised ViT, the instance-level region discovery is modeled as detecting the compact feature subsets in a hierarchical fashion. The hierarchical decomposition results in a hierarchy of instance regions. On the one hand, this kind of hierarchical decomposition well addresses the problem of object embedding and occlusions, which are widely observed in real scenarios. On the other hand, the non-leaf nodes and leaf nodes on the hierarchy correspond to the instance regions in different granularities within an image. Therefore, features in uniform length are produced for these instance regions, which may cover across a dominant image region, an integral of multiple instances, or various individual instances. Such a collection of features allows us to unify the image retrieval, multi-instance search, and instance search into one framework. The empirical studies on three benchmarks show that such an instance-level descriptor remains effective on both the known and unknown object categories. Moreover, the superior performance is achieved on single-instance and multi-instance search, as well as image retrieval tasks.

Paper Structure

This paper contains 18 sections, 10 equations, 9 figures, 10 tables, 1 algorithm.

Figures (9)

  • Figure 1: Semantic region decomposition in the proposed Class Agnostic Instance-level Descriptor (CLAID). It is a multi-level and multi-granularity instance decomposition, leading to excellent performance on three instance search benchmarks.
  • Figure 2: The process diagram showcases different initialization methods for the bisecting clustering. The top row illustrates the clustering with seed selection. The first column shows the original image, followed by the images after the initialization, intermediate clustering result, and the final clustering results. The bottom row shows the clustering under random initialization.
  • Figure 3: An illustration of detecting dense objects in different sizes and shapes by CLAID. Different instances are covered by different colors. For clarity, the instances correspond to the non-leaf nodes are not shown in figure (b).
  • Figure 4: The illustration of visual instances under the "dummy node" in a real scenario. A region (in blue) in Figure (c) is detected as "dummy node" based on Eqn. \ref{['eq:salience']} after the 3rd bisecting. Two latent instances are detected when we further decompose the "dummy node" at the 5th bisecting. Figure (b) visualizes the high energy regions, i.e., $H$ (in red color) in the image.
  • Figure 5: The framework for building instance-level features for instance search. There are two major components: an instance-level region detector and a feature descriptor. Given an image, a set of patch-level features is produced by the self-supervised backbone. The hierarchical decomposition is applied to the feature set. Each node on the hierarchy corresponds to a potential instance region or sub-region. The patch-level masks can be produced for the valid nodes on the hierarchy. The feature for each region is pooled from another network. Given an incoming query, a feature is extracted with the same backbone.
  • ...and 4 more figures