Visually Consistent Hierarchical Image Classification
Seulki Park, Youren Zhang, Stella X. Yu, Sara Beery, Jonathan Huang
TL;DR
This work tackles incorrect hierarchical predictions by enforcing internal visual consistency across taxonomy levels through intra-image segmentation. The proposed H-CAST framework adds a visual-consistency mechanism and a Tree-path KL Divergence semantic loss to align coarse and fine predictions via progressively shared segmentations, yielding higher Full-Path Accuracy and better segmentation without pixel-level labels. Empirical results on BREEDS, CUB, FGVC-Aircraft, and iNaturalist demonstrate substantial gains over state-of-the-art hierarchical and flat models, including vision foundation models, with ablations validating architectural choices and loss functions. The approach not only improves hierarchical recognition but also enhances segmentation quality, suggesting practical benefits for robust, multi-level visual understanding in real-world scenarios.
Abstract
Hierarchical classification predicts labels across multiple levels of a taxonomy, e.g., from coarse-level 'Bird' to mid-level 'Hummingbird' to fine-level 'Green hermit', allowing flexible recognition under varying visual conditions. It is commonly framed as multiple single-level tasks, but each level may rely on different visual cues: Distinguishing 'Bird' from 'Plant' relies on global features like feathers or leaves, while separating 'Anna's hummingbird' from 'Green hermit' requires local details such as head coloration. Prior methods improve accuracy using external semantic supervision, but such statistical learning criteria fail to ensure consistent visual grounding at test time, resulting in incorrect hierarchical classification. We propose, for the first time, to enforce internal visual consistency by aligning fine-to-coarse predictions through intra-image segmentation. Our method outperforms zero-shot CLIP and state-of-the-art baselines on hierarchical classification benchmarks, achieving both higher accuracy and more consistent predictions. It also improves internal image segmentation without requiring pixel-level annotations.
