HPT++: Hierarchically Prompting Vision-Language Models with Multi-Granularity Knowledge Generation and Improved Structure Modeling

Yubin Wang; Xinyang Jiang; De Cheng; Wenli Sun; Dongsheng Li; Cairong Zhao

HPT++: Hierarchically Prompting Vision-Language Models with Multi-Granularity Knowledge Generation and Improved Structure Modeling

Yubin Wang, Xinyang Jiang, De Cheng, Wenli Sun, Dongsheng Li, Cairong Zhao

TL;DR

This paper proposes a novel approach called Hierarchical Prompt Tuning (HPT), enabling simultaneous modeling of both structured and conventional linguistic knowledge, and introduces a relationship-guided attention module to capture pair-wise associations among entities and attributes for low-level prompt learning.

Abstract

Prompt learning has become a prevalent strategy for adapting vision-language foundation models (VLMs) such as CLIP to downstream tasks. With the emergence of large language models (LLMs), recent studies have explored the potential of using category-related descriptions to enhance prompt effectiveness. However, conventional descriptions lack explicit structured information necessary to represent the interconnections among key elements like entities or attributes with relation to a particular category. Since existing prompt tuning methods give little consideration to managing structured knowledge, this paper advocates leveraging LLMs to construct a graph for each description to prioritize such structured knowledge. Consequently, we propose a novel approach called Hierarchical Prompt Tuning (HPT), enabling simultaneous modeling of both structured and conventional linguistic knowledge. Specifically, we introduce a relationship-guided attention module to capture pair-wise associations among entities and attributes for low-level prompt learning. In addition, by incorporating high-level and global-level prompts modeling overall semantics, the proposed hierarchical structure forges cross-level interlinks and empowers the model to handle more complex and long-term relationships. Finally, by enhancing multi-granularity knowledge generation, redesigning the relationship-driven attention re-weighting module, and incorporating consistent constraints on the hierarchical text encoder, we propose HPT++, which further improves the performance of HPT. Our experiments are conducted across a wide range of evaluation settings, including base-to-new generalization, cross-dataset evaluation, and domain generalization. Extensive results and ablation studies demonstrate the effectiveness of our methods, which consistently outperform existing SOTA methods.

HPT++: Hierarchically Prompting Vision-Language Models with Multi-Granularity Knowledge Generation and Improved Structure Modeling

TL;DR

Abstract

Paper Structure (25 sections, 15 equations, 9 figures, 5 tables)

This paper contains 25 sections, 15 equations, 9 figures, 5 tables.

Introduction
Related Work
Large Language Models
Visual-Language Models
Prompt Learning for Vision-Language Models
HPT
Overall Pipeline
Linguistic Data Generation
Hierarchical Prompt Tuning
Relationship-guided Attention Module
HPT++
Overall Improvements
Multi-Granularity Knowledge Generation
Relationship-Driven Attention Re-Weighting Module
Consistent Constraint on Hierarchical Prompted Text Encoder
...and 10 more sections

Figures (9)

Figure 1: We input a few hand-written instructions into LLMs to generate human-like category-related descriptions along with structured graphs based on each description.
Figure 2: Our HPT applies a dual-path asymmetric network as the framework. Descriptions and relationship-guided graphs with class names are used as input for the frozen text encoder and the hierarchical prompted text encoder respectively. In the hierarchical prompted text encoder, we apply three types of prompts, low-level prompts, high-level prompts, and global-level prompts for hierarchical tuning, and design a relationship-guided attention module for modeling structured knowledge.
Figure 3: Illustration of multi-granularity knowledge generation. We firstly compute the similarity between coarse-grained descriptions of different categories, and then generate fine-grained descriptions for each category based on its closest categories. We integrate descriptions of both granularities to produce an overall description with multi-granularity semantics, which is subsequently used for generating structured graphs.
Figure 4: Comparison with existing methods on base-to-new generalization. B: Base Classes. N: New Classes. HM: Harmonic mean. HPT and HPT++ demonstrate strong generalization performance on 11 image recognition datasets.
Figure 5: Comparison with existing methods on cross-dataset evaluation. The best results are highlighted in bold while the second best results are marked with an underline. HPT and HPT++ achieve competitive performance providing the highest average accuracy, indicating superior generalization abilities on other datasets.
...and 4 more figures

HPT++: Hierarchically Prompting Vision-Language Models with Multi-Granularity Knowledge Generation and Improved Structure Modeling

TL;DR

Abstract

HPT++: Hierarchically Prompting Vision-Language Models with Multi-Granularity Knowledge Generation and Improved Structure Modeling

Authors

TL;DR

Abstract

Table of Contents

Figures (9)