Visually Consistent Hierarchical Image Classification

Seulki Park; Youren Zhang; Stella X. Yu; Sara Beery; Jonathan Huang

Visually Consistent Hierarchical Image Classification

Seulki Park, Youren Zhang, Stella X. Yu, Sara Beery, Jonathan Huang

TL;DR

This work tackles incorrect hierarchical predictions by enforcing internal visual consistency across taxonomy levels through intra-image segmentation. The proposed H-CAST framework adds a visual-consistency mechanism and a Tree-path KL Divergence semantic loss to align coarse and fine predictions via progressively shared segmentations, yielding higher Full-Path Accuracy and better segmentation without pixel-level labels. Empirical results on BREEDS, CUB, FGVC-Aircraft, and iNaturalist demonstrate substantial gains over state-of-the-art hierarchical and flat models, including vision foundation models, with ablations validating architectural choices and loss functions. The approach not only improves hierarchical recognition but also enhances segmentation quality, suggesting practical benefits for robust, multi-level visual understanding in real-world scenarios.

Abstract

Hierarchical classification predicts labels across multiple levels of a taxonomy, e.g., from coarse-level 'Bird' to mid-level 'Hummingbird' to fine-level 'Green hermit', allowing flexible recognition under varying visual conditions. It is commonly framed as multiple single-level tasks, but each level may rely on different visual cues: Distinguishing 'Bird' from 'Plant' relies on global features like feathers or leaves, while separating 'Anna's hummingbird' from 'Green hermit' requires local details such as head coloration. Prior methods improve accuracy using external semantic supervision, but such statistical learning criteria fail to ensure consistent visual grounding at test time, resulting in incorrect hierarchical classification. We propose, for the first time, to enforce internal visual consistency by aligning fine-to-coarse predictions through intra-image segmentation. Our method outperforms zero-shot CLIP and state-of-the-art baselines on hierarchical classification benchmarks, achieving both higher accuracy and more consistent predictions. It also improves internal image segmentation without requiring pixel-level annotations.

Visually Consistent Hierarchical Image Classification

TL;DR

Abstract

Paper Structure (26 sections, 6 equations, 11 figures, 12 tables)

This paper contains 26 sections, 6 equations, 11 figures, 12 tables.

Introduction
Related Work
Consistent Hierarchical Classification
H-CAST for Visual Consistency
Tree-path KL Divergence loss for Semantic Consistency
Experiments
Experimental Settings
Hierarchical Classification with Vision Foundation Models
Consistent Hierarchical Classification on Benchmarks
Visualizations of Structured Visual Parsing
Effect of Visual Grounding on Hierarchical Classification
Ablation Analysis of Architecture Design and Loss Function in H-CAST
Additional Benefits of Hierarchical Classification for Segmentation
Summary
Quantitative Evidence for Consistent Visual Grounding
...and 11 more sections

Figures (11)

Figure 1: We propose enforcing internal visual consistency to improve hierarchical classification across taxonomy levels. Prior works rely on external semantic supervision, a statistical criterion that fails to ensure consistent visual focus at test time. Our approach is the first to align predictions through intra-image consistency, improving both accuracy and coherence. Our code is available at https://github.com/pseulki/hcast.
Figure 2: Incorrect hierarchical classification often results from inconsistent visual grounding across hierarchy levels. We show Grad-CAM visualizations selvaraju2017grad of FGN chang2021your trained on BREEDS (Entity-30) data2021breeds. a) A consistent case: both classifiers focus on the same object, with the fine-grained classifier capturing details ( bird leg), and the coarse classifier attending to the whole bird. b) The coarse classifier localizes the chimpanzee but the fine classifier fails to attend to its crucial details and makes a wrong prediction. c) The fine-grained classifier correctly identifies the feather boa, while the coarse classifier wrongly attends to the bicycle. d) Both classifiers attend to misaligned areas and make wrong predictions. These cases show that semantic accuracy relies on consistent visual grounding. Our model aligns visual attention across levels, capture different details within a coherent region, and predict all four cases correctly.
Figure 3: Our method ensures internal visual consistency by aligning coarse and fine classifiers on hierarchical segmentation, unlike prior approaches that rely only on external semantic losses without visual grounding. Segmentation outputs show how fine details (e.g., wings, head, tail) at the 32-way level are grouped into a unified bird region at the 8-way level. Identical color hues indicate consistent groupings, encouraging the model to attend to coherent image regions.
Figure 4: Our method implements visually grounded hierarchical classification through visual and semantic modules. The Visual Consistency module uses fine-to-coarse superpixel groupings to ensure classifiers at different levels focus on corresponding regions while capturing different details. The Semantic Consistency module encodes label hierarchies to align predictions across levels. Together, they encourage cooperative learning across the hierarchy and improve overall performance.
Figure 5: Vision foundation models struggle with consistent predictions in hierarchical classification. We evaluate CLIP ref:clip_2021 on the 2-level BREEDS dataset (top) and present misclassification examples from Entity-13 (bottom). a) CLIP struggles to maintain consistency and correctness, achieving only about 50% accuracy on Entity-13. b) CLIP more frequently predicts the coarse category correctly while misclassifying the fine-grained category compared to H-CAST across all datasets. c) CLIP often predicts the fine-grained category correctly but fails at the coarse level, a mistake that is rare in H-CAST, suggesting difficulty in grasping broader conceptual understanding. H-CAST accurately predicts cases a-c. d) Both CLIP and H-CAST fail in complex scenes.
...and 6 more figures

Visually Consistent Hierarchical Image Classification

TL;DR

Abstract

Visually Consistent Hierarchical Image Classification

Authors

TL;DR

Abstract

Table of Contents

Figures (11)