H3Former: Hypergraph-based Semantic-Aware Aggregation via Hyperbolic Hierarchical Contrastive Loss for Fine-Grained Visual Classification
Yongji Zhang, Siqi Li, Kuiyang Huang, Yue Gao, Yu Jiang
TL;DR
Fine-grained visual classification (FGVC) suffers from subtle inter-class differences and large intra-class variation. The authors introduce H3Former, a token-to-region framework that uses Semantic-Aware Aggregation Module (SAAM) to build a dynamic hypergraph linking tokens via learnable semantic prototypes, and a Hyperbolic Hierarchical Contrastive Loss (HHCL) to enforce hierarchical semantics in both Euclidean and hyperbolic spaces. SAAM aggregates token features into coherent region representations through hypergraph convolution, while HHCL guides the representation learning along a semantic hierarchy using Lorentz-model hyperbolic geometry and a dual contrastive objective. With a Swin Transformer backbone and a Context Generation Module, H3Former achieves state-of-the-art results on four FGVC benchmarks, demonstrating strong discriminability, interpretable region semantics, and robust generalization; notably, Flowers-101 reaches 99.7% accuracy. The approach combines high-order semantic aggregation and geometry-aware supervision to bridge local cues and global structure, offering a practical mechanism for fine-grained recognition in cluttered, multi-instance scenes.
Abstract
Fine-Grained Visual Classification (FGVC) remains a challenging task due to subtle inter-class differences and large intra-class variations. Existing approaches typically rely on feature-selection mechanisms or region-proposal strategies to localize discriminative regions for semantic analysis. However, these methods often fail to capture discriminative cues comprehensively while introducing substantial category-agnostic redundancy. To address these limitations, we propose H3Former, a novel token-to-region framework that leverages high-order semantic relations to aggregate local fine-grained representations with structured region-level modeling. Specifically, we propose the Semantic-Aware Aggregation Module (SAAM), which exploits multi-scale contextual cues to dynamically construct a weighted hypergraph among tokens. By applying hypergraph convolution, SAAM captures high-order semantic dependencies and progressively aggregates token features into compact region-level representations. Furthermore, we introduce the Hyperbolic Hierarchical Contrastive Loss (HHCL), which enforces hierarchical semantic constraints in a non-Euclidean embedding space. The HHCL enhances inter-class separability and intra-class consistency while preserving the intrinsic hierarchical relationships among fine-grained categories. Comprehensive experiments conducted on four standard FGVC benchmarks validate the superiority of our H3Former framework.
