Table of Contents
Fetching ...

Super-class guided Transformer for Zero-Shot Attribute Classification

Sehyung Kim, Chanhyeong Yang, Jihwan Park, Taehoon Song, Hyunwoo J. Kim

TL;DR

SugaFormer tackles zero-shot region-level attribute classification by introducing super-class guided queries to reduce query complexity and by employing multi-context decoding to capture diverse visual cues. It couples two knowledge-transfer mechanisms with frozen Vision-Language Models: SCR aligns image features with super-class prompts during training, and ZRSE refines unseen-attribute predictions at inference by leveraging retrieval-like text–image similarities. Empirical results on VAW, LSA, and OVAD demonstrate state-of-the-art zero-shot and cross-dataset performance, with ablations confirming the additive benefits of SQI, MD, SCR, and ZRSE. The approach enhances scalability and generalizability, offering a practical framework for open-vocabulary attribute classification in real-world applications.

Abstract

Attribute classification is crucial for identifying specific characteristics within image regions. Vision-Language Models (VLMs) have been effective in zero-shot tasks by leveraging their general knowledge from large-scale datasets. Recent studies demonstrate that transformer-based models with class-wise queries can effectively address zero-shot multi-label classification. However, poor utilization of the relationship between seen and unseen attributes makes the model lack generalizability. Additionally, attribute classification generally involves many attributes, making maintaining the model's scalability difficult. To address these issues, we propose Super-class guided transFormer (SugaFormer), a novel framework that leverages super-classes to enhance scalability and generalizability for zero-shot attribute classification. SugaFormer employs Super-class Query Initialization (SQI) to reduce the number of queries, utilizing common semantic information from super-classes, and incorporates Multi-context Decoding (MD) to handle diverse visual cues. To strengthen generalizability, we introduce two knowledge transfer strategies that utilize VLMs. During training, Super-class guided Consistency Regularization (SCR) aligns model's features with VLMs using super-class guided prompts, and during inference, Zero-shot Retrieval-based Score Enhancement (ZRSE) refines predictions for unseen attributes. Extensive experiments demonstrate that SugaFormer achieves state-of-the-art performance across three widely-used attribute classification benchmarks under zero-shot, and cross-dataset transfer settings. Our code is available at https://github.com/mlvlab/SugaFormer.

Super-class guided Transformer for Zero-Shot Attribute Classification

TL;DR

SugaFormer tackles zero-shot region-level attribute classification by introducing super-class guided queries to reduce query complexity and by employing multi-context decoding to capture diverse visual cues. It couples two knowledge-transfer mechanisms with frozen Vision-Language Models: SCR aligns image features with super-class prompts during training, and ZRSE refines unseen-attribute predictions at inference by leveraging retrieval-like text–image similarities. Empirical results on VAW, LSA, and OVAD demonstrate state-of-the-art zero-shot and cross-dataset performance, with ablations confirming the additive benefits of SQI, MD, SCR, and ZRSE. The approach enhances scalability and generalizability, offering a practical framework for open-vocabulary attribute classification in real-world applications.

Abstract

Attribute classification is crucial for identifying specific characteristics within image regions. Vision-Language Models (VLMs) have been effective in zero-shot tasks by leveraging their general knowledge from large-scale datasets. Recent studies demonstrate that transformer-based models with class-wise queries can effectively address zero-shot multi-label classification. However, poor utilization of the relationship between seen and unseen attributes makes the model lack generalizability. Additionally, attribute classification generally involves many attributes, making maintaining the model's scalability difficult. To address these issues, we propose Super-class guided transFormer (SugaFormer), a novel framework that leverages super-classes to enhance scalability and generalizability for zero-shot attribute classification. SugaFormer employs Super-class Query Initialization (SQI) to reduce the number of queries, utilizing common semantic information from super-classes, and incorporates Multi-context Decoding (MD) to handle diverse visual cues. To strengthen generalizability, we introduce two knowledge transfer strategies that utilize VLMs. During training, Super-class guided Consistency Regularization (SCR) aligns model's features with VLMs using super-class guided prompts, and during inference, Zero-shot Retrieval-based Score Enhancement (ZRSE) refines predictions for unseen attributes. Extensive experiments demonstrate that SugaFormer achieves state-of-the-art performance across three widely-used attribute classification benchmarks under zero-shot, and cross-dataset transfer settings. Our code is available at https://github.com/mlvlab/SugaFormer.
Paper Structure (20 sections, 16 equations, 8 figures, 10 tables)

This paper contains 20 sections, 16 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Hierarchical structure of attributes and effectiveness of super-class guided prompt. (A) Attributes belonging to the same super-class share common semantic information. (B) By leveraging super-class guided prompts, VLMs make more accurate predictions by distinguishing attributes within each super-class.
  • Figure 2: Model architecture. The overall pipeline of SugaFormer includes extracting multi-context visual features $\mathbf{f}_\mathbf{x}$ using an image, cropped image, and masked image. The super-class query $\mathbf{q}_{j}$ and pooled visual feature $\textbf{v}$ are concatenated. The concatenated query passed through different projection layers $\text{Proj}^{\text{x}}(\cdot)$, generating super-class queries $\mathbf{\tilde{q}}_j^\mathbf{x}$. Each $\text{Decoder}_{\textbf{x}}(\cdot)$ processes its respective $\mathbf{\tilde{q}}_j^\mathbf{x}$ with corresponding visual feature maps $\mathbf{f}_\mathbf{x}$, producing output queries $\mathbf{\hat{q}}^{\textbf{x}}_j$. Logit scores $\hat{c}_i^\mathbf{x}$ are computed via the inner product between $\mathbf{t}_i$ and $\mathbf{\hat{q}}^{\textbf{x}}_j$. The averaged logit score $\bar{c}_i$ is used to calculate the prediction score $\hat{p}^i$ for the $i$-th attribute. To enhance generalizability, super-class guided consistency regularization is applied during training, and zero-shot retrieval-based score enhancement is used during inference. Best viewed in color.
  • Figure 3: Illustration of knowledge transfer strategies. (A) During training, the Q-Former extracts a [MASK] token feature $\mathbf{\hat{p}}_j^l$ using a super-class guided prompt $\mathbf{p}_j$. This process leverages learned queries $\mathbf{z}$ and a tokenized prompt which is obtained by using a the prompt $\mathbf{p}_j$ that integrates the $j$-th super-class and the object class name. We compute $\mathcal{L}_{\text{SCR}}$ which is obtained by measuring L1 distance between the mean of output features $\mathbf{\bar{q}}_{j}$ from multi-context decoding and the [MASK] token feature $\mathbf{\hat{p}}_j^l$. (B) During inference, we employ ZRSE in which maximum similarity $\hat{r}_i$ obtained from Q-Former to compensate for novel classes. Note that we use the same frozen Q-Former.
  • Figure 4: Visusalization of attribute text embeddings. The UMAP visualization shows class text embeddings. The classes belonging to the same super-class are visualized with the same color and naturally form distinct, well-defined clusters. In addition, novel classes fall into the corresponding clusters. This explains why the similarity-based mapping function between classes and super-classes is effective.
  • Figure 5: Similarity-based super-class mapping.
  • ...and 3 more figures