Table of Contents
Fetching ...

Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds

Wu Wei, Xiaomeng Fan, Yuwei Wu, Zhi Gao, Pengxiang Li, Yunde Jia, Mehrtash Harandi

TL;DR

The paper tackles asymmetric modality alignment in vision-language models by constructing and aligning tree-like hierarchical features for images and text on heterogeneous hyperbolic manifolds. It introduces a semantic-aware visual feature extraction framework that builds coarse-to-fine visual feature trees and embeds them alongside textual trees in separate hyperbolic spaces, with an optimized intermediate manifold learned via KL-based distances. The authors establish existence and uniqueness for the optimal intermediate manifold and employ entailment-based losses to align inter- and intra-modal hierarchies. Empirical results on taxonomic open-set classification demonstrate consistent improvements in few-shot and cross-domain settings, showing strong generalization and controllable geometric alignment benefits.

Abstract

Modality alignment is critical for vision-language models (VLMs) to effectively integrate information across modalities. However, existing methods extract hierarchical features from text while representing each image with a single feature, leading to asymmetric and suboptimal alignment. To address this, we propose Alignment across Trees, a method that constructs and aligns tree-like hierarchical features for both image and text modalities. Specifically, we introduce a semantic-aware visual feature extraction framework that applies a cross-attention mechanism to visual class tokens from intermediate Transformer layers, guided by textual cues to extract visual features with coarse-to-fine semantics. We then embed the feature trees of the two modalities into hyperbolic manifolds with distinct curvatures to effectively model their hierarchical structures. To align across the heterogeneous hyperbolic manifolds with different curvatures, we formulate a KL distance measure between distributions on heterogeneous manifolds, and learn an intermediate manifold for manifold alignment by minimizing the distance. We prove the existence and uniqueness of the optimal intermediate manifold. Experiments on taxonomic open-set classification tasks across multiple image datasets demonstrate that our method consistently outperforms strong baselines under few-shot and cross-domain settings.

Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds

TL;DR

The paper tackles asymmetric modality alignment in vision-language models by constructing and aligning tree-like hierarchical features for images and text on heterogeneous hyperbolic manifolds. It introduces a semantic-aware visual feature extraction framework that builds coarse-to-fine visual feature trees and embeds them alongside textual trees in separate hyperbolic spaces, with an optimized intermediate manifold learned via KL-based distances. The authors establish existence and uniqueness for the optimal intermediate manifold and employ entailment-based losses to align inter- and intra-modal hierarchies. Empirical results on taxonomic open-set classification demonstrate consistent improvements in few-shot and cross-domain settings, showing strong generalization and controllable geometric alignment benefits.

Abstract

Modality alignment is critical for vision-language models (VLMs) to effectively integrate information across modalities. However, existing methods extract hierarchical features from text while representing each image with a single feature, leading to asymmetric and suboptimal alignment. To address this, we propose Alignment across Trees, a method that constructs and aligns tree-like hierarchical features for both image and text modalities. Specifically, we introduce a semantic-aware visual feature extraction framework that applies a cross-attention mechanism to visual class tokens from intermediate Transformer layers, guided by textual cues to extract visual features with coarse-to-fine semantics. We then embed the feature trees of the two modalities into hyperbolic manifolds with distinct curvatures to effectively model their hierarchical structures. To align across the heterogeneous hyperbolic manifolds with different curvatures, we formulate a KL distance measure between distributions on heterogeneous manifolds, and learn an intermediate manifold for manifold alignment by minimizing the distance. We prove the existence and uniqueness of the optimal intermediate manifold. Experiments on taxonomic open-set classification tasks across multiple image datasets demonstrate that our method consistently outperforms strong baselines under few-shot and cross-domain settings.

Paper Structure

This paper contains 39 sections, 3 theorems, 50 equations, 7 figures, 6 tables, 3 algorithms.

Key Result

Theorem 1

Given two manifolds $\mathcal{L}^{c_{1}}$ and $\mathcal{L}^{c_{3}}$, the distributions on the two manifolds are We define the distance between $\mathcal{L}^{c_1}$ and $\mathcal{L}^{c_3}$ as an affine transformation of the Kullback-Leibler (KL) divergence, which is where $r$ is a constant that depends on $u_1$ and $u_2$.

Figures (7)

  • Figure 1: Comparison between previous methods and our method. Previous methods extract a single visual feature to align with hierarchical textual features in Euclidean spaces. This asymmetric alignment leads to inferior prediction. In contrast, our method achieves a symmetrical alignment by extracting hierarchical visual features on hyperbolic manifolds, leading to optimal prediction.
  • Figure 2: Pipeline of our method.
  • Figure 3: Structure of semantic-aware visual feature extraction framework. A cross-attention module is employed to generate semantic-aware visual features $\boldsymbol{v}_i$ at the same semantic level as $\boldsymbol{t}_i$.
  • Figure 4: Illustration of entailment.
  • Figure 5: T-SNE visualization of learned image representations, colored by taxonomic labels. The baseline ProTeCt (B) is shown in the first row, while our method (O) is shown in the second row. Our method demonstrates improved feature separability across taxonomic categories.
  • ...and 2 more figures

Theorems & Definitions (3)

  • Theorem 1
  • Proposition 1
  • Proposition 2