Learning Visual Hierarchies in Hyperbolic Space for Image Retrieval
Ziwei Wang, Sameera Ramasinghe, Chenchen Xu, Julien Monteil, Loris Bazzani, Thalaiyasingam Ajanthan
TL;DR
This work tackles the challenge of learning visual hierarchies by embedding multi-level relationships in hyperbolic space $\mathbb{H}^d$ without explicit hierarchical labels. It introduces a hyperbolic angle-based entailment loss $L_{angle}$ that enforces asymmetric parent–child entailment across within- and cross-image pairs, leveraging the Lorentz model and tangent-space maps. The authors additionally propose an optimal transport-based metric (1-D Wasserstein distance) to evaluate hierarchical retrieval and demonstrate strong improvements on a newly constructed HierOpenImages dataset, with notable generalization to out-of-domain LVIS, VOC, and COCO data. Overall, the approach yields semantic-structural representations that surpass standard visual-similarity objectives for hierarchical image retrieval and scene understanding.
Abstract
Structuring latent representations in a hierarchical manner enables models to learn patterns at multiple levels of abstraction. However, most prevalent image understanding models focus on visual similarity, and learning visual hierarchies is relatively unexplored. In this work, for the first time, we introduce a learning paradigm that can encode user-defined multi-level complex visual hierarchies in hyperbolic space without requiring explicit hierarchical labels. As a concrete example, first, we define a part-based image hierarchy using object-level annotations within and across images. Then, we introduce an approach to enforce the hierarchy using contrastive loss with pairwise entailment metrics. Finally, we discuss new evaluation metrics to effectively measure hierarchical image retrieval. Encoding these complex relationships ensures that the learned representations capture semantic and structural information that transcends mere visual similarity. Experiments in part-based image retrieval show significant improvements in hierarchical retrieval tasks, demonstrating the capability of our model in capturing visual hierarchies.
