Table of Contents
Fetching ...

Tree of Attributes Prompt Learning for Vision-Language Models

Tong Ding, Wanhua Li, Zhongqi Miao, Hanspeter Pfister

TL;DR

This work tackles the limitation of unstructured, generic prompts in vision-language models by introducing TAP, which distills structured knowledge from LLMs into a Tree of Attributes (ToA) for each class. It couples top-down ToA generation with bottom-up attribute-level aggregation via vision-conditional pooling and learnable domain-expert tokens in both vision and text streams, anchored to a CLIP backbone. Across 11 datasets and multiple evaluation regimes (base-to-novel, cross-dataset, few-shot), TAP achieves state-of-the-art performance, highlighting the benefits of structured attribute hierarchies and instance-specific description pooling for robust image-text alignment. The approach also provides interpretable attribution through attribute-focused attention and Grad-CAM visualizations, though its reliance on LLM prompts may pose challenges for highly fine-grained distinctions, suggesting avenues for future improvement in LLM robustness or alternative knowledge sources.

Abstract

Prompt learning has proven effective in adapting vision language models for downstream tasks. However, existing methods usually append learnable prompt tokens solely with the category names to obtain textual features, which fails to fully leverage the rich context indicated in the category name. To address this issue, we propose the Tree of Attributes Prompt learning (TAP), which first instructs LLMs to generate a tree of attributes with a "concept - attribute - description" structure for each category, and then learn the hierarchy with vision and text prompt tokens. Unlike existing methods that merely augment category names with a set of unstructured descriptions, our approach essentially distills structured knowledge graphs associated with class names from LLMs. Furthermore, our approach introduces text and vision prompts designed to explicitly learn the corresponding visual attributes, effectively serving as domain experts. Additionally, the general and diverse descriptions generated based on the class names may be wrong or absent in the specific given images. To address this misalignment, we further introduce a vision-conditional pooling module to extract instance-specific text features. Extensive experimental results demonstrate that our approach outperforms state-of-the-art methods on the zero-shot base-to-novel generalization, cross-dataset transfer, as well as few-shot classification across 11 diverse datasets. Code is available at https://github.com/HHenryD/TAP.

Tree of Attributes Prompt Learning for Vision-Language Models

TL;DR

This work tackles the limitation of unstructured, generic prompts in vision-language models by introducing TAP, which distills structured knowledge from LLMs into a Tree of Attributes (ToA) for each class. It couples top-down ToA generation with bottom-up attribute-level aggregation via vision-conditional pooling and learnable domain-expert tokens in both vision and text streams, anchored to a CLIP backbone. Across 11 datasets and multiple evaluation regimes (base-to-novel, cross-dataset, few-shot), TAP achieves state-of-the-art performance, highlighting the benefits of structured attribute hierarchies and instance-specific description pooling for robust image-text alignment. The approach also provides interpretable attribution through attribute-focused attention and Grad-CAM visualizations, though its reliance on LLM prompts may pose challenges for highly fine-grained distinctions, suggesting avenues for future improvement in LLM robustness or alternative knowledge sources.

Abstract

Prompt learning has proven effective in adapting vision language models for downstream tasks. However, existing methods usually append learnable prompt tokens solely with the category names to obtain textual features, which fails to fully leverage the rich context indicated in the category name. To address this issue, we propose the Tree of Attributes Prompt learning (TAP), which first instructs LLMs to generate a tree of attributes with a "concept - attribute - description" structure for each category, and then learn the hierarchy with vision and text prompt tokens. Unlike existing methods that merely augment category names with a set of unstructured descriptions, our approach essentially distills structured knowledge graphs associated with class names from LLMs. Furthermore, our approach introduces text and vision prompts designed to explicitly learn the corresponding visual attributes, effectively serving as domain experts. Additionally, the general and diverse descriptions generated based on the class names may be wrong or absent in the specific given images. To address this misalignment, we further introduce a vision-conditional pooling module to extract instance-specific text features. Extensive experimental results demonstrate that our approach outperforms state-of-the-art methods on the zero-shot base-to-novel generalization, cross-dataset transfer, as well as few-shot classification across 11 diverse datasets. Code is available at https://github.com/HHenryD/TAP.

Paper Structure

This paper contains 22 sections, 12 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Illustration of the methods for CLIP text prompts formation. (a) Manually created prompt with the single "a photo of a {class}" template; (b) A unstructured set of detailed descriptions generated by LLMs; (c) The proposed Tree of Attribute distills a knowledge graph from LLMs, organizing the knowledge in "concept - attribute - descriptions" structure; (d) An example Tree of Attribute for class "dumplings", where each color represents a visual attribute.
  • Figure 2: Overview of the proposed TAP method. TAP uses a bottom-up approach to aggregate the generated Tree of Attribute. The vision-conditional pooling (VCP) layer aggregates descriptions into attribute-level features, which are aligned with visual expert tokens focusing on specific attributes (e.g., color, texture). These attribute-level features are then combined to make class predictions via a weighted sum of logits from each attribute, fully leveraging the hierarchical structure within the tree.
  • Figure 3: Visualization of the class activation maps.
  • Figure 4: Visualization of the attention weights in the VCP layer for an example "dumplings" image.
  • Figure 6: Effects of $\alpha$