Improving Visual Recognition with Hyperbolical Visual Hierarchy Mapping
Hyeongjun Kwon, Jinhyun Jang, Jin Kim, Kwonyoung Kim, Kwanghoon Sohn
TL;DR
Hi-Mapper tackles the lack of structured visual hierarchy in pre-trained DNNs by constructing a probabilistic hierarchy tree where leaf nodes are Gaussians and higher levels are mixtures of Gaussians, regularized by KL to prevent collapse. The hierarchy is learned in hyperbolic space via a Lorentz-model embedding and a novel hierarchical contrastive loss defined on Lorentz distances, while a hierarchy decomposition decomposes features from a pre-trained encoder and a hierarchy encoding module updates the global representation. The approach is plug-and-play, improving image classification, object detection, and semantic segmentation across diverse backbones (ResNet, EfficientNet, DeiT, Swin) on standard benchmarks, with only modest parameter overhead. This demonstrates that incorporating probabilistic, hyperbolic hierarchies yields better structured scene representations and boosts performance across both vision-language-aligned and dense prediction tasks.
Abstract
Visual scenes are naturally organized in a hierarchy, where a coarse semantic is recursively comprised of several fine details. Exploring such a visual hierarchy is crucial to recognize the complex relations of visual elements, leading to a comprehensive scene understanding. In this paper, we propose a Visual Hierarchy Mapper (Hi-Mapper), a novel approach for enhancing the structured understanding of the pre-trained Deep Neural Networks (DNNs). Hi-Mapper investigates the hierarchical organization of the visual scene by 1) pre-defining a hierarchy tree through the encapsulation of probability densities; and 2) learning the hierarchical relations in hyperbolic space with a novel hierarchical contrastive loss. The pre-defined hierarchy tree recursively interacts with the visual features of the pre-trained DNNs through hierarchy decomposition and encoding procedures, thereby effectively identifying the visual hierarchy and enhancing the recognition of an entire scene. Extensive experiments demonstrate that Hi-Mapper significantly enhances the representation capability of DNNs, leading to an improved performance on various tasks, including image classification and dense prediction tasks.
