Improving Visual Recognition with Hyperbolical Visual Hierarchy Mapping

Hyeongjun Kwon; Jinhyun Jang; Jin Kim; Kwonyoung Kim; Kwanghoon Sohn

Improving Visual Recognition with Hyperbolical Visual Hierarchy Mapping

Hyeongjun Kwon, Jinhyun Jang, Jin Kim, Kwonyoung Kim, Kwanghoon Sohn

TL;DR

Hi-Mapper tackles the lack of structured visual hierarchy in pre-trained DNNs by constructing a probabilistic hierarchy tree where leaf nodes are Gaussians and higher levels are mixtures of Gaussians, regularized by KL to prevent collapse. The hierarchy is learned in hyperbolic space via a Lorentz-model embedding and a novel hierarchical contrastive loss defined on Lorentz distances, while a hierarchy decomposition decomposes features from a pre-trained encoder and a hierarchy encoding module updates the global representation. The approach is plug-and-play, improving image classification, object detection, and semantic segmentation across diverse backbones (ResNet, EfficientNet, DeiT, Swin) on standard benchmarks, with only modest parameter overhead. This demonstrates that incorporating probabilistic, hyperbolic hierarchies yields better structured scene representations and boosts performance across both vision-language-aligned and dense prediction tasks.

Abstract

Visual scenes are naturally organized in a hierarchy, where a coarse semantic is recursively comprised of several fine details. Exploring such a visual hierarchy is crucial to recognize the complex relations of visual elements, leading to a comprehensive scene understanding. In this paper, we propose a Visual Hierarchy Mapper (Hi-Mapper), a novel approach for enhancing the structured understanding of the pre-trained Deep Neural Networks (DNNs). Hi-Mapper investigates the hierarchical organization of the visual scene by 1) pre-defining a hierarchy tree through the encapsulation of probability densities; and 2) learning the hierarchical relations in hyperbolic space with a novel hierarchical contrastive loss. The pre-defined hierarchy tree recursively interacts with the visual features of the pre-trained DNNs through hierarchy decomposition and encoding procedures, thereby effectively identifying the visual hierarchy and enhancing the recognition of an entire scene. Extensive experiments demonstrate that Hi-Mapper significantly enhances the representation capability of DNNs, leading to an improved performance on various tasks, including image classification and dense prediction tasks.

Improving Visual Recognition with Hyperbolical Visual Hierarchy Mapping

TL;DR

Abstract

Paper Structure (42 sections, 21 equations, 8 figures, 6 tables)

This paper contains 42 sections, 21 equations, 8 figures, 6 tables.

Introduction
Related Work
Hierarchy-aware visual recognition.
Probabilistic modeling.
Hyperbolic manifold.
Hyperbolic Geometry
Method
Overview
Probabilistic hierarchy tree
Initial level.
Subsequent level.
KL divergence loss.
Visual hierarchy decomposition
Learning hierarchy in hyperbolic space
Hierarchical contrastive loss.
...and 27 more sections

Figures (8)

Figure 1: (a) A visual scene can be decomposed into a hierarchical structure based on the semantics of each visual element. (b) Euclidean space is suboptimal in representing the hierarchical structure due to its flat nature. The relational distance is inaccurately captured, being unaware of the semantic similarity of visual elements (Red line). Hi-Mapper maps the hierarchical elements in hyperbolic space, which effectively preserves their semantic relations and distances due to its constant negative curvature.
Figure 2: (a) An overview of the proposed Hi-Mapper. Hi-Mapper operates on top of pre-trained image encoder $\mathcal{F}$, with probabilistic hierarchy tree $\mathbf{T}=\{\mathbf{C}^{l}\}^{L}_{l=1}$. The tree interacts with visual feature map $\textbf{v}_\text{map}$ through hierarchy decomposition module $\mathcal{D}$, thereby identifying visual hierarchy in Euclidean space $\mathbf{T}_\mathbb{E}=\{\mathbf{S}^{l}\}^{L}_{l=1}$. The visual hierarchy is mapped to hyperbolic space $\mathbf{T}_{\mathbb{L}} = \{\mathbf{H}^{l}\}_{l=1}^{L}$ and optimized with hierarchical contrastive loss $\mathcal{L}_{\mathbb{L}\text{-cont}}$. The visual hierarchy is further encoded into global visual representation $\textbf{v}_\text{cls}$ via hierarchy encoding module $\mathcal{G}$ for enhancing the recognition of entire scene. (b) The proposed hierarchical contrastive loss pulls each parent-child node and pushes all the other nodes at the same level.
Figure 3: (a) Hierarchy decomposition module groups semantically-relevant visual features $\mathbf{v}_{\text{map}}$ to the closest semantic cluster $\mathbf{C}^{l}$. (b) Hierarchy encoding module progressively updates global representation $\mathbf{v}_{\text{cls}}$ by aggregating the visual hierarchy $\mathbf{S}^{l}$.
Figure 4: Visualization of visual hierarchy decomposed by Hi-Mapper(Deit-S) trained on ImageNet-1K with classification objective. Each color represents different subtrees. We ignore the nodes of the small region and display only the main subtrees.
Figure 5: Hyper-parameter analysis on ImageNet-1K.
...and 3 more figures

Improving Visual Recognition with Hyperbolical Visual Hierarchy Mapping

TL;DR

Abstract

Improving Visual Recognition with Hyperbolical Visual Hierarchy Mapping

Authors

TL;DR

Abstract

Table of Contents

Figures (8)