Table of Contents
Fetching ...

Language-driven Fine-grained Retrieval

Shijie Wang, Xin Yu, Yadan Luo, Zijian Wang, Pengfei Zhang, Zi Huang

TL;DR

FGIR models trained with one-hot labels struggle to compare cross-category details and generalize to unseen categories. LaFG tackles this by translating category names into attribute-level supervision using LLM-generated descriptions projected into a vision-aligned space by a frozen VLM, then distilling a dataset-wide attribute vocabulary and per-class linguistic prototypes to supervise retrieval. These prototypes supervise a ViT-based retrieval model via distribution alignment between visual embeddings and linguistic attributes, aided by an auxiliary contrastive loss. Experiments on CUB-200-2011, Cars-196, and SOP show state-of-the-art Recall@1 and strong generalization to unseen categories, validating language-driven supervision as a scalable approach to FGIR and cross-domain retrieval.

Abstract

Existing fine-grained image retrieval (FGIR) methods learn discriminative embeddings by adopting semantically sparse one-hot labels derived from category names as supervision. While effective on seen classes, such supervision overlooks the rich semantics encoded in category names, hindering the modeling of comparability among cross-category details and, in turn, limiting generalization to unseen categories. To tackle this, we introduce LaFG, a Language-driven framework for Fine-Grained Retrieval that converts class names into attribute-level supervision using large language models (LLMs) and vision-language models (VLMs). Treating each name as a semantic anchor, LaFG prompts an LLM to generate detailed, attribute-oriented descriptions. To mitigate attribute omission in these descriptions, it leverages a frozen VLM to project them into a vision-aligned space, clustering them into a dataset-wide attribute vocabulary while harvesting complementary attributes from related categories. Leveraging this vocabulary, a global prompt template selects category-relevant attributes, which are aggregated into category-specific linguistic prototypes. These prototypes supervise the retrieval model to steer

Language-driven Fine-grained Retrieval

TL;DR

FGIR models trained with one-hot labels struggle to compare cross-category details and generalize to unseen categories. LaFG tackles this by translating category names into attribute-level supervision using LLM-generated descriptions projected into a vision-aligned space by a frozen VLM, then distilling a dataset-wide attribute vocabulary and per-class linguistic prototypes to supervise retrieval. These prototypes supervise a ViT-based retrieval model via distribution alignment between visual embeddings and linguistic attributes, aided by an auxiliary contrastive loss. Experiments on CUB-200-2011, Cars-196, and SOP show state-of-the-art Recall@1 and strong generalization to unseen categories, validating language-driven supervision as a scalable approach to FGIR and cross-domain retrieval.

Abstract

Existing fine-grained image retrieval (FGIR) methods learn discriminative embeddings by adopting semantically sparse one-hot labels derived from category names as supervision. While effective on seen classes, such supervision overlooks the rich semantics encoded in category names, hindering the modeling of comparability among cross-category details and, in turn, limiting generalization to unseen categories. To tackle this, we introduce LaFG, a Language-driven framework for Fine-Grained Retrieval that converts class names into attribute-level supervision using large language models (LLMs) and vision-language models (VLMs). Treating each name as a semantic anchor, LaFG prompts an LLM to generate detailed, attribute-oriented descriptions. To mitigate attribute omission in these descriptions, it leverages a frozen VLM to project them into a vision-aligned space, clustering them into a dataset-wide attribute vocabulary while harvesting complementary attributes from related categories. Leveraging this vocabulary, a global prompt template selects category-relevant attributes, which are aggregated into category-specific linguistic prototypes. These prototypes supervise the retrieval model to steer

Paper Structure

This paper contains 13 sections, 10 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Motivation of LaFG. (a) Learning with one-hot labels compresses class names into a single global identifiers and overlooks parts and attributes, making it hard to compare appearance details when facing unseen categories. Hence, similar local regions become indistinguishable, which degrades generalization to unseen categories. (b) Language-driven learning turns category names into linguistic supervision, thus establishing detail comparability. The model acquires transferable discriminative knowledge and improves retrieval on unseen categories.
  • Figure 2: Framework illustration of Language-driven Fine-grained Generalization. See §\ref{['LaFG']} for more details.
  • Figure 3: Visualization of clustered attribute responses for the same subcategory (e.g., Blue Jay). The Top-5 attributes are selected from the vocabulary based on their similarity scores. (a) Input images; (b)–(f) Attribute response regions for comparison.
  • Figure 4: Illustration of class activation maps generated by the baseline and our LaFG. (a) and (d) show the input images; (b) and (e) present the corresponding class activation maps produced by the baseline; (c) and (f) display the maps generated by our LaFG.