HGCLIP: Exploring Vision-Language Models with Graph Representations for Hierarchical Understanding

Peng Xia; Xingtong Yu; Ming Hu; Lie Ju; Zhiyong Wang; Peibo Duan; Zongyuan Ge

HGCLIP: Exploring Vision-Language Models with Graph Representations for Hierarchical Understanding

Peng Xia, Xingtong Yu, Ming Hu, Lie Ju, Zhiyong Wang, Peibo Duan, Zongyuan Ge

TL;DR

HGCLIP tackles hierarchical image classification by marrying Vision-Language Models with graph-based hierarchies. It introduces learnable multi-modal prompts and dual graph encoders to propagate hierarchical structure into textual and visual representations, while using prototype-based visual terms and attention to align patch-level features with class prototypes. The method achieves state-of-the-art performance across 11 hierarchical benchmarks and demonstrates robustness to noisy hierarchy inputs from large language models and to distribution shifts. This work highlights a scalable, trainable approach to multi-granularity understanding that can adapt to datasets with or without predefined hierarchies, advancing practical hierarchical recognition in vision-language systems.

Abstract

Object categories are typically organized into a multi-granularity taxonomic hierarchy. When classifying categories at different hierarchy levels, traditional uni-modal approaches focus primarily on image features, revealing limitations in complex scenarios. Recent studies integrating Vision-Language Models (VLMs) with class hierarchies have shown promise, yet they fall short of fully exploiting the hierarchical relationships. These efforts are constrained by their inability to perform effectively across varied granularity of categories. To tackle this issue, we propose a novel framework (HGCLIP) that effectively combines CLIP with a deeper exploitation of the Hierarchical class structure via Graph representation learning. We explore constructing the class hierarchy into a graph, with its nodes representing the textual or image features of each category. After passing through a graph encoder, the textual features incorporate hierarchical structure information, while the image features emphasize class-aware features derived from prototypes through the attention mechanism. Our approach demonstrates significant improvements on 11 diverse visual recognition benchmarks. Our codes are fully available at https://github.com/richard-peng-xia/HGCLIP.

HGCLIP: Exploring Vision-Language Models with Graph Representations for Hierarchical Understanding

TL;DR

Abstract

Paper Structure (32 sections, 13 equations, 11 figures, 11 tables)

This paper contains 32 sections, 13 equations, 11 figures, 11 tables.

Introduction
Related Work
Preliminaries
Revisiting CLIP
Graph Encoder
Methodology
Hierarchy Setting
Multi-modal Hierarchical Prompt
Delving into Graph Representations
Experiment
Benchmark Setting
Hierarchical Image Classification
Ablative Analysis
Qualitative Analysis
Graph Encoder Analysis
...and 17 more sections

Figures (11)

Figure 1: An illustration of the graph representation based on class hierarchy. (a) The class hierarchy is presented in a tree structure. (b) The hierarchical labels are constructed into a graph, with nodes representing the text/image features of each class. The graph is fed into a graph encoder, where the nodes update the parameters by aggregating the messages from their neighboring nodes. Thus, the class features are fused with hierarchical information via graph representation learning.
Figure 2: t-SNE plots of image embeddings in SOTA method CoCoOp, MaPLe, and HGCLIP on two datasets with distinct semantic granularities. HGCLIP shows better separability in both fine-grained and coarse-grained levels.
Figure 3: The pipeline of HGCLIP for adapting CLIP to hierarchical image classification. We introduce multi-modal hierarchical prompt to learn contextual representations. Then we construct the label hierarchy into a graph, with its nodes representing the textual or image features of each class. Features integrate hierarchical structure information through message passing in the graph encoder. Textual features directly combine hierarchical representations, while image features focus on class-aware prototypes through the attention mechanism.
Figure 4: Semantic prototypes are constructed to guide the learning of hierarchical semantics of images.
Figure 5: Example decisions from our model, MaPLe and CLIP.
...and 6 more figures

HGCLIP: Exploring Vision-Language Models with Graph Representations for Hierarchical Understanding

TL;DR

Abstract

HGCLIP: Exploring Vision-Language Models with Graph Representations for Hierarchical Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (11)