Table of Contents
Fetching ...

H3Former: Hypergraph-based Semantic-Aware Aggregation via Hyperbolic Hierarchical Contrastive Loss for Fine-Grained Visual Classification

Yongji Zhang, Siqi Li, Kuiyang Huang, Yue Gao, Yu Jiang

TL;DR

Fine-grained visual classification (FGVC) suffers from subtle inter-class differences and large intra-class variation. The authors introduce H3Former, a token-to-region framework that uses Semantic-Aware Aggregation Module (SAAM) to build a dynamic hypergraph linking tokens via learnable semantic prototypes, and a Hyperbolic Hierarchical Contrastive Loss (HHCL) to enforce hierarchical semantics in both Euclidean and hyperbolic spaces. SAAM aggregates token features into coherent region representations through hypergraph convolution, while HHCL guides the representation learning along a semantic hierarchy using Lorentz-model hyperbolic geometry and a dual contrastive objective. With a Swin Transformer backbone and a Context Generation Module, H3Former achieves state-of-the-art results on four FGVC benchmarks, demonstrating strong discriminability, interpretable region semantics, and robust generalization; notably, Flowers-101 reaches 99.7% accuracy. The approach combines high-order semantic aggregation and geometry-aware supervision to bridge local cues and global structure, offering a practical mechanism for fine-grained recognition in cluttered, multi-instance scenes.

Abstract

Fine-Grained Visual Classification (FGVC) remains a challenging task due to subtle inter-class differences and large intra-class variations. Existing approaches typically rely on feature-selection mechanisms or region-proposal strategies to localize discriminative regions for semantic analysis. However, these methods often fail to capture discriminative cues comprehensively while introducing substantial category-agnostic redundancy. To address these limitations, we propose H3Former, a novel token-to-region framework that leverages high-order semantic relations to aggregate local fine-grained representations with structured region-level modeling. Specifically, we propose the Semantic-Aware Aggregation Module (SAAM), which exploits multi-scale contextual cues to dynamically construct a weighted hypergraph among tokens. By applying hypergraph convolution, SAAM captures high-order semantic dependencies and progressively aggregates token features into compact region-level representations. Furthermore, we introduce the Hyperbolic Hierarchical Contrastive Loss (HHCL), which enforces hierarchical semantic constraints in a non-Euclidean embedding space. The HHCL enhances inter-class separability and intra-class consistency while preserving the intrinsic hierarchical relationships among fine-grained categories. Comprehensive experiments conducted on four standard FGVC benchmarks validate the superiority of our H3Former framework.

H3Former: Hypergraph-based Semantic-Aware Aggregation via Hyperbolic Hierarchical Contrastive Loss for Fine-Grained Visual Classification

TL;DR

Fine-grained visual classification (FGVC) suffers from subtle inter-class differences and large intra-class variation. The authors introduce H3Former, a token-to-region framework that uses Semantic-Aware Aggregation Module (SAAM) to build a dynamic hypergraph linking tokens via learnable semantic prototypes, and a Hyperbolic Hierarchical Contrastive Loss (HHCL) to enforce hierarchical semantics in both Euclidean and hyperbolic spaces. SAAM aggregates token features into coherent region representations through hypergraph convolution, while HHCL guides the representation learning along a semantic hierarchy using Lorentz-model hyperbolic geometry and a dual contrastive objective. With a Swin Transformer backbone and a Context Generation Module, H3Former achieves state-of-the-art results on four FGVC benchmarks, demonstrating strong discriminability, interpretable region semantics, and robust generalization; notably, Flowers-101 reaches 99.7% accuracy. The approach combines high-order semantic aggregation and geometry-aware supervision to bridge local cues and global structure, offering a practical mechanism for fine-grained recognition in cluttered, multi-instance scenes.

Abstract

Fine-Grained Visual Classification (FGVC) remains a challenging task due to subtle inter-class differences and large intra-class variations. Existing approaches typically rely on feature-selection mechanisms or region-proposal strategies to localize discriminative regions for semantic analysis. However, these methods often fail to capture discriminative cues comprehensively while introducing substantial category-agnostic redundancy. To address these limitations, we propose H3Former, a novel token-to-region framework that leverages high-order semantic relations to aggregate local fine-grained representations with structured region-level modeling. Specifically, we propose the Semantic-Aware Aggregation Module (SAAM), which exploits multi-scale contextual cues to dynamically construct a weighted hypergraph among tokens. By applying hypergraph convolution, SAAM captures high-order semantic dependencies and progressively aggregates token features into compact region-level representations. Furthermore, we introduce the Hyperbolic Hierarchical Contrastive Loss (HHCL), which enforces hierarchical semantic constraints in a non-Euclidean embedding space. The HHCL enhances inter-class separability and intra-class consistency while preserving the intrinsic hierarchical relationships among fine-grained categories. Comprehensive experiments conducted on four standard FGVC benchmarks validate the superiority of our H3Former framework.

Paper Structure

This paper contains 16 sections, 16 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Hyperedges ($\mathcal{E}^1$–$\mathcal{E}^4$) of hypergraph $\mathcal{H}=(\mathcal{V},\mathcal{E})$ generated by our $\text{H}^3$Former. Distinct hyperedges correspond to meaningful semantic regions, e.g., tail feathers, wing, beak, and eye. The learned hypergraphs automatically highlight key discriminative parts without any part-level supervision. $\text{H}^3$Former adaptively constructs coherent semantic regions through its hypergraph construction mechanism, bridging local token cues and global structural representation for FGVC.
  • Figure 2: Illustration of different FGVC paradigms. (a) Feature-selection based methods perform token filtering in the feature space to retain features most relevant to fine-grained recognition, but overlook coherent semantic structure. (b) Region-relation based methods learn pairwise dependencies among predefined regions, typically obtained from RPNs, which may introduce redundant and category-agnostic information. (c) Our proposed $\text{H}^3$Former organizes discrete tokens into structured semantic regions via a hypergraph formulation, where each hyperedge adaptively aggregates related tokens. Furthermore, the proposed HHCL imposes hierarchical constraints to enhance the discriminability and consistency of these regions.
  • Figure 3: Overview of the proposed $\text{H}^3$Former framework. The Semantic-Aware Aggregation Module (SAAM) constructs a weighted hypergraph to capture high-order semantic relations and progressively aggregates tokens into semantically coherent regions. Meanwhile, the Hyperbolic Hierarchical Contrastive Loss (HHCL) operates on the resulting hierarchical region representations to enforce fine-grained category separation and structural consistency in two spaces, yielding more discriminative representations.
  • Figure 4: The architecture of the Context Generation Module (CGM). The CGM utilizes the token features and attention maps from each stage to generate corresponding context vectors that encode multi-scale contextual information. When window-based attention is used, the attention maps are processed along the dashed path to produce the importance vector, which reflects the relative significance of tokens within each window.
  • Figure 5: Illustration of hierarchical hypergraph modeling and HHCL loss. (a) HHCL consists of $\mathcal{L}_{hpop}$ for hierarchical consistency, $\mathcal{L}_{hcon}$ for hyperbolic contrastive learning, and $\mathcal{L}_{econ}$ for euclidean discrimination. (b) Region-level features are hierarchically merged based on semantic similarity. (c) SAAM performs soft hypergraph message passing from tokens to regions and back.
  • ...and 4 more figures