Table of Contents
Fetching ...

ConceptHash: Interpretable Fine-Grained Hashing via Concept Discovery

Kam Woh Ng, Xiatian Zhu, Yi-Zhe Song, Tao Xiang

TL;DR

ConceptHash addresses the challenge of interpretable fine-grained hashing by introducing M learnable concept tokens as visual prompts in a Vision Transformer, yielding sub-codes that map to human-understandable concepts. Language guidance from CLIP is used to form semantic class centers $o_c$, aligning hash codes with textual concept representations and promoting family-level similarity while preserving inter-family distinctions. The learning objective combines discriminative, quantization, and regularization terms to encourage diverse, semantically coherent concept representations, with adapters enabling efficient finetuning. Empirically, ConceptHash achieves state-of-the-art results on four fine-grained benchmarks and provides explicit interpretability at the sub-code level, offering a practical path toward more transparent and controllable retrieval systems.

Abstract

Existing fine-grained hashing methods typically lack code interpretability as they compute hash code bits holistically using both global and local features. To address this limitation, we propose ConceptHash, a novel method that achieves sub-code level interpretability. In ConceptHash, each sub-code corresponds to a human-understandable concept, such as an object part, and these concepts are automatically discovered without human annotations. Specifically, we leverage a Vision Transformer architecture and introduce concept tokens as visual prompts, along with image patch tokens as model inputs. Each concept is then mapped to a specific sub-code at the model output, providing natural sub-code interpretability. To capture subtle visual differences among highly similar sub-categories (e.g., bird species), we incorporate language guidance to ensure that the learned hash codes are distinguishable within fine-grained object classes while maintaining semantic alignment. This approach allows us to develop hash codes that exhibit similarity within families of species while remaining distinct from species in other families. Extensive experiments on four fine-grained image retrieval benchmarks demonstrate that ConceptHash outperforms previous methods by a significant margin, offering unique sub-code interpretability as an additional benefit. Code at: https://github.com/kamwoh/concepthash.

ConceptHash: Interpretable Fine-Grained Hashing via Concept Discovery

TL;DR

ConceptHash addresses the challenge of interpretable fine-grained hashing by introducing M learnable concept tokens as visual prompts in a Vision Transformer, yielding sub-codes that map to human-understandable concepts. Language guidance from CLIP is used to form semantic class centers , aligning hash codes with textual concept representations and promoting family-level similarity while preserving inter-family distinctions. The learning objective combines discriminative, quantization, and regularization terms to encourage diverse, semantically coherent concept representations, with adapters enabling efficient finetuning. Empirically, ConceptHash achieves state-of-the-art results on four fine-grained benchmarks and provides explicit interpretability at the sub-code level, offering a practical path toward more transparent and controllable retrieval systems.

Abstract

Existing fine-grained hashing methods typically lack code interpretability as they compute hash code bits holistically using both global and local features. To address this limitation, we propose ConceptHash, a novel method that achieves sub-code level interpretability. In ConceptHash, each sub-code corresponds to a human-understandable concept, such as an object part, and these concepts are automatically discovered without human annotations. Specifically, we leverage a Vision Transformer architecture and introduce concept tokens as visual prompts, along with image patch tokens as model inputs. Each concept is then mapped to a specific sub-code at the model output, providing natural sub-code interpretability. To capture subtle visual differences among highly similar sub-categories (e.g., bird species), we incorporate language guidance to ensure that the learned hash codes are distinguishable within fine-grained object classes while maintaining semantic alignment. This approach allows us to develop hash codes that exhibit similarity within families of species while remaining distinct from species in other families. Extensive experiments on four fine-grained image retrieval benchmarks demonstrate that ConceptHash outperforms previous methods by a significant margin, offering unique sub-code interpretability as an additional benefit. Code at: https://github.com/kamwoh/concepthash.
Paper Structure (13 sections, 11 equations, 10 figures, 5 tables)

This paper contains 13 sections, 11 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: In the proposed ConceptHash, a set of concept tokens (3 tokens in this illustration) are introduced in a vision Transformer to discover automatically human understandable semantics (e.g., bird head by the first concept token for generating the first two-bit sub-code 00). Further, within the subcode, all similar concepts (e.g., terns, warbler) are semantically grouped.
  • Figure 2: Overview of our ConceptHash model in a Vision Transformer (ViT) framework. To enable sub-code level interpretability, (i) we introduce a set of $M$ concept tokens along with the image patch tokens as the input. After self-attention based representation learning, (ii) each of these concept tokens is then used to compute a sub-code, all of which are then concatenated to form the entire hash code. (iii) To compensate for limited information of visual observation, textual information of class names is further leveraged by learning more semantically meaningful hash class centers. For model training, a combination of classification loss $\mathcal{L}_\text{clf}$, quantization error $\mathcal{L}_\text{quan}$, concept spatial diversity constraint $\mathcal{L}_\text{csd}$, and concept discrimination constraint $\mathcal{L}_\text{cd}$ is applied concurrently. To increase training efficiency, Adapter houlsby2019parameter is added to the ViT instead of fine-tuning all parameters.
  • Figure 3: We visualize the discovered concepts by our ConceptHash: (a, b, c) The bird body parts discovered on CUB-200-2011. (d, e, f) The car parts discovered on Stanford Cars. Setting: 6-bit hash codes where $M=3$ concepts are used each for 2-bit sub-code. Bottom-left, top-left, top-right, and bottom-right regions represent the sub-codes 00, 01, 11, and 10 respectively.
  • Figure 4: The regions where the hash function will focus on while computing a hash code.
  • Figure 5: tSNE of the hash centers. The top 10 families of fine-grained classes in CUB-200-2011 are plotted for clarity.
  • ...and 5 more figures