Table of Contents
Fetching ...

Hyperbolic Learning with Multimodal Large Language Models

Paolo Mandica, Luca Franco, Konstantinos Kallidromitis, Suzanne Petryk, Fabio Galasso

TL;DR

This work presents the first large-scale hyperbolic vision-language model built on BLIP-2, showing that hyperbolic embeddings can encode uncertainty via embedding radii while achieving performance on par with Euclidean baselines. A stable training strategy is developed, including cosine-based similarity in hyperbolic space, Random Query Selection, and Random Text Pruning to preserve diversity and uncertainty signals. Empirical results on COCO indicate that hyperbolic BLIP-2 can match Euclidean performance and offer meaningful uncertainty proxies, though using the Poincaré distance directly for contrastive loss can destabilize training. The study highlights both the potential and the practical challenges of scaling hyperbolic representations in multimodal models, and introduces techniques with broad applicability to large VLMs for improved robustness and interpretability.

Abstract

Hyperbolic embeddings have demonstrated their effectiveness in capturing measures of uncertainty and hierarchical relationships across various deep-learning tasks, including image segmentation and active learning. However, their application in modern vision-language models (VLMs) has been limited. A notable exception is MERU, which leverages the hierarchical properties of hyperbolic space in the CLIP ViT-large model, consisting of hundreds of millions parameters. In our work, we address the challenges of scaling multi-modal hyperbolic models by orders of magnitude in terms of parameters (billions) and training complexity using the BLIP-2 architecture. Although hyperbolic embeddings offer potential insights into uncertainty not present in Euclidean embeddings, our analysis reveals that scaling these models is particularly difficult. We propose a novel training strategy for a hyperbolic version of BLIP-2, which allows to achieve comparable performance to its Euclidean counterpart, while maintaining stability throughout the training process and showing a meaningful indication of uncertainty with each embedding.

Hyperbolic Learning with Multimodal Large Language Models

TL;DR

This work presents the first large-scale hyperbolic vision-language model built on BLIP-2, showing that hyperbolic embeddings can encode uncertainty via embedding radii while achieving performance on par with Euclidean baselines. A stable training strategy is developed, including cosine-based similarity in hyperbolic space, Random Query Selection, and Random Text Pruning to preserve diversity and uncertainty signals. Empirical results on COCO indicate that hyperbolic BLIP-2 can match Euclidean performance and offer meaningful uncertainty proxies, though using the Poincaré distance directly for contrastive loss can destabilize training. The study highlights both the potential and the practical challenges of scaling hyperbolic representations in multimodal models, and introduces techniques with broad applicability to large VLMs for improved robustness and interpretability.

Abstract

Hyperbolic embeddings have demonstrated their effectiveness in capturing measures of uncertainty and hierarchical relationships across various deep-learning tasks, including image segmentation and active learning. However, their application in modern vision-language models (VLMs) has been limited. A notable exception is MERU, which leverages the hierarchical properties of hyperbolic space in the CLIP ViT-large model, consisting of hundreds of millions parameters. In our work, we address the challenges of scaling multi-modal hyperbolic models by orders of magnitude in terms of parameters (billions) and training complexity using the BLIP-2 architecture. Although hyperbolic embeddings offer potential insights into uncertainty not present in Euclidean embeddings, our analysis reveals that scaling these models is particularly difficult. We propose a novel training strategy for a hyperbolic version of BLIP-2, which allows to achieve comparable performance to its Euclidean counterpart, while maintaining stability throughout the training process and showing a meaningful indication of uncertainty with each embedding.
Paper Structure (43 sections, 13 equations, 4 figures, 1 table)

This paper contains 43 sections, 13 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: (top-row) Images with lowest hyperbolic radius; (bottom-row) Images with highest hyperbolic radius. Radius is indicated above each image.
  • Figure 2: Per class radius distribution, using the annotations in the COCO dataset.
  • Figure 3: Plot of average hyperbolic radius for the image embeddings (left) and text embeddings (right). The radius is stable at the maximum level using hidden dimension 256 (cyan), while it varies using a lower dimension, i.e., 16 (purple). The text embedding does not converge at the edge of the Poincaré ball, only using Random Text Pruning (green), resulting in a more variable and meaningful radius.
  • Figure 4: Query selection distribution across the test dataset. The top graph depicts the selection performed by the BLIP-2 model, whereas the bottom graph shows the selection performed by our model.