Table of Contents
Fetching ...

HVT: A Comprehensive Vision Framework for Learning in Non-Euclidean Space

Jacob Fein-Ashley, Ethan Feng, Minh Pham

TL;DR

This work introduces the Hyperbolic Vision Transformer (HVT), a Vision Transformer extended to hyperbolic space to better capture hierarchical visual structures. By embedding learnable curvature in positional encodings, using Möbius-based hyperbolic layers, and incorporating a hyperbolic self-attention mechanism, HVT aims to preserve hierarchical relationships throughout the network. The paper provides theoretical foundations, a full set of hyperbolic components, and optimization strategies (e.g., Riemannian Adam, geodesic regularization) and demonstrates improved ImageNet performance over Euclidean ViT at matching parameter counts. Overall, the results suggest that hyperbolic geometry offers a principled and effective path to modeling complex visual hierarchies, with potential impact on scalable, structure-aware vision systems.

Abstract

Data representation in non-Euclidean spaces has proven effective for capturing hierarchical and complex relationships in real-world datasets. Hyperbolic spaces, in particular, provide efficient embeddings for hierarchical structures. This paper introduces the Hyperbolic Vision Transformer (HVT), a novel extension of the Vision Transformer (ViT) that integrates hyperbolic geometry. While traditional ViTs operate in Euclidean space, our method enhances the self-attention mechanism by leveraging hyperbolic distance and Möbius transformations. This enables more effective modeling of hierarchical and relational dependencies in image data. We present rigorous mathematical formulations, showing how hyperbolic geometry can be incorporated into attention layers, feed-forward networks, and optimization. We offer improved performance for image classification using the ImageNet dataset.

HVT: A Comprehensive Vision Framework for Learning in Non-Euclidean Space

TL;DR

This work introduces the Hyperbolic Vision Transformer (HVT), a Vision Transformer extended to hyperbolic space to better capture hierarchical visual structures. By embedding learnable curvature in positional encodings, using Möbius-based hyperbolic layers, and incorporating a hyperbolic self-attention mechanism, HVT aims to preserve hierarchical relationships throughout the network. The paper provides theoretical foundations, a full set of hyperbolic components, and optimization strategies (e.g., Riemannian Adam, geodesic regularization) and demonstrates improved ImageNet performance over Euclidean ViT at matching parameter counts. Overall, the results suggest that hyperbolic geometry offers a principled and effective path to modeling complex visual hierarchies, with potential impact on scalable, structure-aware vision systems.

Abstract

Data representation in non-Euclidean spaces has proven effective for capturing hierarchical and complex relationships in real-world datasets. Hyperbolic spaces, in particular, provide efficient embeddings for hierarchical structures. This paper introduces the Hyperbolic Vision Transformer (HVT), a novel extension of the Vision Transformer (ViT) that integrates hyperbolic geometry. While traditional ViTs operate in Euclidean space, our method enhances the self-attention mechanism by leveraging hyperbolic distance and Möbius transformations. This enables more effective modeling of hierarchical and relational dependencies in image data. We present rigorous mathematical formulations, showing how hyperbolic geometry can be incorporated into attention layers, feed-forward networks, and optimization. We offer improved performance for image classification using the ImageNet dataset.
Paper Structure (46 sections, 27 equations, 1 figure, 5 tables)