Table of Contents
Fetching ...

Training-Free Dual Hyperbolic Adapters for Better Cross-Modal Reasoning

Yi Zhang, Chun-Wun Cheng, Junyi He, Ke Yu, Yushun Tang, Carola-Bibiane Schönlieb, Zhihai He, Angelica I. Aviles-Rivero

TL;DR

The paper tackles domain shift in vision-language models by introducing a training-free adaptation method that leverages hyperbolic geometry. It presents Training-free Dual Hyperbolic Adapters (T-DHA), which embed class concepts in the Poincaré ball and use both positive and negative prototypes for image-image and image-text predictions, fused with a learned residual weight. By utilizing hyperbolic distance and explicit negative learning, T-DHA achieves superior few-shot performance and domain generalization without fine-tuning, validated across 11 datasets and multiple backbones. These results demonstrate the practical potential of geometry-aware, training-free adaptation for robust cross-modal reasoning in real-world settings.

Abstract

Recent research in Vision-Language Models (VLMs) has significantly advanced our capabilities in cross-modal reasoning. However, existing methods suffer from performance degradation with domain changes or require substantial computational resources for fine-tuning in new domains. To address this issue, we develop a new adaptation method for large vision-language models, called \textit{Training-free Dual Hyperbolic Adapters} (T-DHA). We characterize the vision-language relationship between semantic concepts, which typically has a hierarchical tree structure, in the hyperbolic space instead of the traditional Euclidean space. Hyperbolic spaces exhibit exponential volume growth with radius, unlike the polynomial growth in Euclidean space. We find that this unique property is particularly effective for embedding hierarchical data structures using the Poincaré ball model, achieving significantly improved representation and discrimination power. Coupled with negative learning, it provides more accurate and robust classifications with fewer feature dimensions. Our extensive experimental results on various datasets demonstrate that the T-DHA method significantly outperforms existing state-of-the-art methods in few-shot image recognition and domain generalization tasks.

Training-Free Dual Hyperbolic Adapters for Better Cross-Modal Reasoning

TL;DR

The paper tackles domain shift in vision-language models by introducing a training-free adaptation method that leverages hyperbolic geometry. It presents Training-free Dual Hyperbolic Adapters (T-DHA), which embed class concepts in the Poincaré ball and use both positive and negative prototypes for image-image and image-text predictions, fused with a learned residual weight. By utilizing hyperbolic distance and explicit negative learning, T-DHA achieves superior few-shot performance and domain generalization without fine-tuning, validated across 11 datasets and multiple backbones. These results demonstrate the practical potential of geometry-aware, training-free adaptation for robust cross-modal reasoning in real-world settings.

Abstract

Recent research in Vision-Language Models (VLMs) has significantly advanced our capabilities in cross-modal reasoning. However, existing methods suffer from performance degradation with domain changes or require substantial computational resources for fine-tuning in new domains. To address this issue, we develop a new adaptation method for large vision-language models, called \textit{Training-free Dual Hyperbolic Adapters} (T-DHA). We characterize the vision-language relationship between semantic concepts, which typically has a hierarchical tree structure, in the hyperbolic space instead of the traditional Euclidean space. Hyperbolic spaces exhibit exponential volume growth with radius, unlike the polynomial growth in Euclidean space. We find that this unique property is particularly effective for embedding hierarchical data structures using the Poincaré ball model, achieving significantly improved representation and discrimination power. Coupled with negative learning, it provides more accurate and robust classifications with fewer feature dimensions. Our extensive experimental results on various datasets demonstrate that the T-DHA method significantly outperforms existing state-of-the-art methods in few-shot image recognition and domain generalization tasks.

Paper Structure

This paper contains 11 sections, 16 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Hierarchical Data Representation in Euclidean vs. Hyperbolic Spaces. (A) The example displays a hierarchical categorization of mammals into dogs and cats, further divided by breeds. (B) Euclidean space struggles to effectively capture hierarchical structures. (C) Hyperbolic space naturally accommodates tree-like structures, maintaining clear distinctions between hierarchical levels. This motivates our use of hyperbolic class prototypes within the T-DHA architecture (Section III-B), where embeddings are explicitly mapped into the Poincaré ball model for improved representation.
  • Figure 2: Embedding Images in a Conceptual Hierarchy. Euclidean geometry (a) struggles to model the exponential growth of semantic concepts around a central node (e.g., a Maine Coon cat). In contrast, hyperbolic geometry (b) provides a native structure for such hierarchies, with volume expanding exponentially to incorporate all concepts. The right-hand side figure in hyperbolic geometry (b) has been taken from MathWorld, Wolfram Research mathworld_poincare_disk
  • Figure 3: Overview of the Proposed Training-free Dual Hyperbolic Adapters (T-DHA). The architecture leverage image-image and image-text predictions. For image-image prediction, visual features are first extracted using the CLIP encoder and then mapped into hyperbolic space via the exponential map (Equation 2). Predictions are computed using Poincaré distance to both positive and negative class prototypes in hyperbolic space. The image-text branch similarly integrates positive and negative predictions via prompt-based cosine similarity. These geometry-aware branches are fused to yield the final prediction. All geometric operations are now annotated in the figure and fully described in Section III-B.
  • Figure 4: Classification Performance Comparison on Training-free Few-shot Learning, i.e., 1-/2-/4-/8-/16-shot, on 11 benchmark datasets. The top-left is the averaged accuracy across all 11 datasets.
  • Figure 5: Visualization comparison of feature distributions in Euclidean vs. hyperbolic spaces.. Left: CLIP features in Euclidean space (t-SNE projection) show significant overlap between similar classes, making classification challenging. Right: T-DHA features in hyperbolic space (Poincaré disk) achieve clear separation of all classes, demonstrating the advantage of hyperbolic geometry for hierarchical data. Colors represent different classes: Red/Light Red = Dog breeds (Labrador/Poodle), Blue/Light Blue = Cat breeds (Persian/Siamese). The hyperbolic representation better preserves semantic hierarchies, enabling more accurate few-shot classification.