Hyperbolic Learning with Multimodal Large Language Models
Paolo Mandica, Luca Franco, Konstantinos Kallidromitis, Suzanne Petryk, Fabio Galasso
TL;DR
This work presents the first large-scale hyperbolic vision-language model built on BLIP-2, showing that hyperbolic embeddings can encode uncertainty via embedding radii while achieving performance on par with Euclidean baselines. A stable training strategy is developed, including cosine-based similarity in hyperbolic space, Random Query Selection, and Random Text Pruning to preserve diversity and uncertainty signals. Empirical results on COCO indicate that hyperbolic BLIP-2 can match Euclidean performance and offer meaningful uncertainty proxies, though using the Poincaré distance directly for contrastive loss can destabilize training. The study highlights both the potential and the practical challenges of scaling hyperbolic representations in multimodal models, and introduces techniques with broad applicability to large VLMs for improved robustness and interpretability.
Abstract
Hyperbolic embeddings have demonstrated their effectiveness in capturing measures of uncertainty and hierarchical relationships across various deep-learning tasks, including image segmentation and active learning. However, their application in modern vision-language models (VLMs) has been limited. A notable exception is MERU, which leverages the hierarchical properties of hyperbolic space in the CLIP ViT-large model, consisting of hundreds of millions parameters. In our work, we address the challenges of scaling multi-modal hyperbolic models by orders of magnitude in terms of parameters (billions) and training complexity using the BLIP-2 architecture. Although hyperbolic embeddings offer potential insights into uncertainty not present in Euclidean embeddings, our analysis reveals that scaling these models is particularly difficult. We propose a novel training strategy for a hyperbolic version of BLIP-2, which allows to achieve comparable performance to its Euclidean counterpart, while maintaining stability throughout the training process and showing a meaningful indication of uncertainty with each embedding.
