Table of Contents
Fetching ...

A Hyperbolic Perspective on Hierarchical Structure in Object-Centric Scene Representations

Neelu Madan, Àlex Pujol, Andreas Møgelmose, Sergio Escalera, Kamal Nasrollahi, Graham W. Taylor, Thomas B. Moeslund

Abstract

Slot attention has emerged as a powerful framework for unsupervised object-centric learning, decomposing visual scenes into a small set of compact vector representations called \emph{slots}, each capturing a distinct region or object. However, these slots are learned in Euclidean space, which provides no geometric inductive bias for the hierarchical relationships that naturally structure visual scenes. In this work, we propose a simple post-hoc pipeline to project Euclidean slot embeddings onto the Lorentz hyperboloid of hyperbolic space, without modifying the underlying training pipeline. We construct five-level visual hierarchies directly from slot attention masks and analyse whether hyperbolic geometry reveals latent hierarchical structure that remains invisible in Euclidean space. Integrating our pipeline with SPOT (images), VideoSAUR (video), and SlotContrast (video), We find that hyperbolic projection exposes a consistent scene-level to object-level organisation, where coarse slots occupy greater manifold depth than fine slots, which is absent in Euclidean space. We further identify a "curvature--task tradeoff": low curvature ($c{=}0.2$) matches or outperforms Euclidean on parent slot retrieval, while moderate curvature ($c{=}0.5$) achieves better inter-level separation. Together, these findings suggest that slot representations already encode latent hierarchy that hyperbolic geometry reveals, motivating end-to-end hyperbolic training as a natural next step. Code and models are available at \href{https://github.com/NeeluMadan/HHS}{github.com/NeeluMadan/HHS}.

A Hyperbolic Perspective on Hierarchical Structure in Object-Centric Scene Representations

Abstract

Slot attention has emerged as a powerful framework for unsupervised object-centric learning, decomposing visual scenes into a small set of compact vector representations called \emph{slots}, each capturing a distinct region or object. However, these slots are learned in Euclidean space, which provides no geometric inductive bias for the hierarchical relationships that naturally structure visual scenes. In this work, we propose a simple post-hoc pipeline to project Euclidean slot embeddings onto the Lorentz hyperboloid of hyperbolic space, without modifying the underlying training pipeline. We construct five-level visual hierarchies directly from slot attention masks and analyse whether hyperbolic geometry reveals latent hierarchical structure that remains invisible in Euclidean space. Integrating our pipeline with SPOT (images), VideoSAUR (video), and SlotContrast (video), We find that hyperbolic projection exposes a consistent scene-level to object-level organisation, where coarse slots occupy greater manifold depth than fine slots, which is absent in Euclidean space. We further identify a "curvature--task tradeoff": low curvature () matches or outperforms Euclidean on parent slot retrieval, while moderate curvature () achieves better inter-level separation. Together, these findings suggest that slot representations already encode latent hierarchy that hyperbolic geometry reveals, motivating end-to-end hyperbolic training as a natural next step. Code and models are available at \href{https://github.com/NeeluMadan/HHS}{github.com/NeeluMadan/HHS}.
Paper Structure (16 sections, 7 equations, 8 figures, 2 tables)

This paper contains 16 sections, 7 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Visual hierarchy on the Lorentz hyperboloid. Abstract scene-level concepts reside near the apex; representations grow increasingly fine-grained with depth, with geodesics encoding parent--child relationships. This geometric property motivates our analysis of object-centric slot representations through a hyperbolic lens.
  • Figure 2: Overview of our post-hoc pipeline. Patch features from a frozen DINOv2 backbone are decoded into $N$ slot representations via Slot Attention. The Euclidean path reconstructs slot masks through the baseline decoder, from which ground-truth parent--child pairs $\mathcal{P}$ are derived via mask inclusion. The hyperbolic path projects the same slots onto the Lorentz hyperboloid $\mathbb{H}^d_K$ via the exponential map, where geodesic distances and Lorentz norms are used for hierarchical analysis without any modification to the original training.
  • Figure 3: SlotContrast (YTVIS)
  • Figure 4: VideoSAURv2 (YTVIS)
  • Figure 5: SPOT (COCO)
  • ...and 3 more figures