Table of Contents
Fetching ...

HELM: Hierarchical and Explicit Label Modeling with Graph Learning for Multi-Label Image Classification

Marjan Stoimchev, Boshko Koloski, Jurica Levatić, Dragi Kocev, Sašo Džeroski

Abstract

Hierarchical multi-label classification (HMLC) is essential for modeling complex label dependencies in remote sensing. Existing methods, however, struggle with multi-path hierarchies where instances belong to multiple branches, and they rarely exploit unlabeled data. We introduce HELM (\textit{Hierarchical and Explicit Label Modeling}), a novel framework that overcomes these limitations. HELM: (i) uses hierarchy-specific class tokens within a Vision Transformer to capture nuanced label interactions; (ii) employs graph convolutional networks to explicitly encode the hierarchical structure and generate hierarchy-aware embeddings; and (iii) integrates a self-supervised branch to effectively leverage unlabeled imagery. We perform a comprehensive evaluation on four remote sensing image (RSI) datasets (UCM, AID, DFC-15, MLRSNet). HELM achieves state-of-the-art performance, consistently outperforming strong baselines in both supervised and semi-supervised settings, demonstrating particular strength in low-label scenarios.

HELM: Hierarchical and Explicit Label Modeling with Graph Learning for Multi-Label Image Classification

Abstract

Hierarchical multi-label classification (HMLC) is essential for modeling complex label dependencies in remote sensing. Existing methods, however, struggle with multi-path hierarchies where instances belong to multiple branches, and they rarely exploit unlabeled data. We introduce HELM (\textit{Hierarchical and Explicit Label Modeling}), a novel framework that overcomes these limitations. HELM: (i) uses hierarchy-specific class tokens within a Vision Transformer to capture nuanced label interactions; (ii) employs graph convolutional networks to explicitly encode the hierarchical structure and generate hierarchy-aware embeddings; and (iii) integrates a self-supervised branch to effectively leverage unlabeled imagery. We perform a comprehensive evaluation on four remote sensing image (RSI) datasets (UCM, AID, DFC-15, MLRSNet). HELM achieves state-of-the-art performance, consistently outperforming strong baselines in both supervised and semi-supervised settings, demonstrating particular strength in low-label scenarios.
Paper Structure (26 sections, 6 equations, 5 figures, 5 tables)

This paper contains 26 sections, 6 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: High-level overview of HELM. The framework integrates a ViT encoder with hierarchy-specific tokens that feed three distinct branches: (a) classification, (b) graph learning with a GCN, and (c) a BYOL self-supervised branch. The losses from each branch are combined to optimize the model end-to-end.
  • Figure 2: Semi-supervised results with different labeled proportions. HELM consistently outperforms the supervised baseline, with the largest gains at 1–5% labeled data. Full results are available in Appendix \ref{['app:ssl_results']}.
  • Figure 3: Example of constructed label hierarchy for the UCM dataset, derived from the CORINE Land Cover nomenclature. The hierarchy demonstrates the 3-level structure with 4 top-level categories, 9 intermediate-level categories, and 17 leaf-level labels.
  • Figure 4: Comparison of training times, performance, and parameter counts for the baseline methods and HELM on the UCM dataset. The size of each bubble corresponds to the number of parameters (in millions).
  • Figure 5: Comparison of 2-D UMAP embeddings between HELM and state-of-the-art methods for the UCM dataset. The learned embeddings are colored based on different levels of the UCM label hierarchy. The visualization is based on embeddings corresponding to leaf labels, while the color coding reflects the grouping and relationships at each hierarchical level. The NMI values are reported for each method, where higher values indicate better alignment between clusters and ground truth labels, reflecting the quality of hierarchical embeddings.