Table of Contents
Fetching ...

PHyCLIP: $\ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning

Daiki Yoshikawa, Takashi Matsubara

TL;DR

PHyCLIP addresses the challenge of encoding both taxonomic hierarchies and cross-family compositionality in vision–language representations. It introduces an $\ell_1$-product metric space of hyperbolic factors, embedding images and texts as tuples in $({\mathbb H^d})^k$ to separate intra-family taxonomy from cross-family conjunctions. The method combines hyperbolic entailment cones with a contrastive objective, achieving strong results on zero-shot classification, retrieval, hierarchical classification, and compositional understanding, while offering interpretable, factor-level structure. This dual-structure embedding provides a principled and scalable approach to multi-modal semantics with practical benefits for retrieval, recognition, and structured understanding in real-world data.

Abstract

Vision-language models have achieved remarkable success in multi-modal representation learning from large-scale pairs of visual scenes and linguistic descriptions. However, they still struggle to simultaneously express two distinct types of semantic structures: the hierarchy within a concept family (e.g., dog $\preceq$ mammal $\preceq$ animal) and the compositionality across different concept families (e.g., "a dog in a car" $\preceq$ dog, car). Recent works have addressed this challenge by employing hyperbolic space, which efficiently captures tree-like hierarchy, yet its suitability for representing compositionality remains unclear. To resolve this dilemma, we propose PHyCLIP, which employs an $\ell_1$-Product metric on a Cartesian product of Hyperbolic factors. With our design, intra-family hierarchies emerge within individual hyperbolic factors, and cross-family composition is captured by the $\ell_1$-product metric, analogous to a Boolean algebra. Experiments on zero-shot classification, retrieval, hierarchical classification, and compositional understanding tasks demonstrate that PHyCLIP outperforms existing single-space approaches and offers more interpretable structures in the embedding space.

PHyCLIP: $\ell_1$-Product of Hyperbolic Factors Unifies Hierarchy and Compositionality in Vision-Language Representation Learning

TL;DR

PHyCLIP addresses the challenge of encoding both taxonomic hierarchies and cross-family compositionality in vision–language representations. It introduces an -product metric space of hyperbolic factors, embedding images and texts as tuples in to separate intra-family taxonomy from cross-family conjunctions. The method combines hyperbolic entailment cones with a contrastive objective, achieving strong results on zero-shot classification, retrieval, hierarchical classification, and compositional understanding, while offering interpretable, factor-level structure. This dual-structure embedding provides a principled and scalable approach to multi-modal semantics with practical benefits for retrieval, recognition, and structured understanding in real-world data.

Abstract

Vision-language models have achieved remarkable success in multi-modal representation learning from large-scale pairs of visual scenes and linguistic descriptions. However, they still struggle to simultaneously express two distinct types of semantic structures: the hierarchy within a concept family (e.g., dog mammal animal) and the compositionality across different concept families (e.g., "a dog in a car" dog, car). Recent works have addressed this challenge by employing hyperbolic space, which efficiently captures tree-like hierarchy, yet its suitability for representing compositionality remains unclear. To resolve this dilemma, we propose PHyCLIP, which employs an -Product metric on a Cartesian product of Hyperbolic factors. With our design, intra-family hierarchies emerge within individual hyperbolic factors, and cross-family composition is captured by the -product metric, analogous to a Boolean algebra. Experiments on zero-shot classification, retrieval, hierarchical classification, and compositional understanding tasks demonstrate that PHyCLIP outperforms existing single-space approaches and offers more interpretable structures in the embedding space.

Paper Structure

This paper contains 47 sections, 6 theorems, 18 equations, 7 figures, 5 tables.

Key Result

Theorem 1

Let $\mathbb{H}^d$ be a $d$-dimensional hyperbolic space with the hyperbolic distance $d_{\mathbb{H}^d}$. For every finite metric tree $T$ (and every infinite metric tree $T$ with known bounds for maximum degree and minimum edge length), and for every $\varepsilon>0$, there exist a scale $\tau>0$ an

Figures (7)

  • Figure 1: Conceptual diagram of hierarchical and compositional structures. While all arrows represent entailments ($\preceq$), they differ in nature. (upper) Linguistic concepts organize tree-like taxonomic hierarchies of concept families, each of which can be embedded into a hyperbolic space Nickel2017. (middle) Images and texts exhibit compositionality across distinct concept families, which can be captured by a Boolean algebra or an $\ell_1$-product metric. (lower) Images are instances of their corresponding captions.
  • Figure 2: Overview of PHyCLIP. Images and texts are encoded as points $\bm{X}$ in an $\ell_1$-product metric space of hyperbolic factors, $(\mathbb{H}^d)^k$, that is, as tuples of points $\bm{x}^{(i)}$ in hyperbolic spaces $\mathbb{H}^d_i$, where their distance is defined by the sum of hyperbolic distances. The entailment relations $\bm{X}\preceq \bm{Y}$ are encoded using entailment cones as $\bm{x}^{(i)}\in C(\bm{y}^{(i)})$ within hyperbolic factors $\mathbb{H}^d_i$.
  • Figure 3: Norm distributions. In (b) and (c), image norms are consistently larger than text norms, because images are more specific than their paired texts ($I_b \preceq T_b$). However, in a single hyperbolic factor shown in (a), image and text norms largely overlap, as PHyCLIP may keep some factors unused for instances that do not contain the corresponding concept families.
  • Figure 4: Visualization of factor-wise embeddings. (a) Each concept (e.g., dog or car) activates a distinct factor (i.e., $i=39$ or $i=9$), and their composition (e.g., "a dog and a car") activates the corresponding factors simultaneously. (b) A set of relevant concepts (e.g., hyponyms of mammals) forms a hierarchical structure in the corresponding factor (e.g., $i=39$), while they cluster near the origin in another factor (e.g., $i=9$).
  • Figure 5: Embeddings projected onto 2D disks by HoroPCA. A set of relevant concepts (hyponyms of mammals or words related to vehicles and everyday-carry items) forms a hierarchical structure in the corresponding factor ($i=39$ or $i=9$), while the same concepts cluster near the origin in another factor ($i=9$ or $i=39$).
  • ...and 2 more figures

Theorems & Definitions (10)

  • Theorem 1: Hyperbolic embedding of trees Sarkar2011
  • Definition 1: $\ell_1$-product metric space
  • Proposition 1: Embedding of Boolean Lattice
  • Theorem 2: Embedding into an $\ell_1$-product metric space of hyperbolic factors
  • Definition 2: Quasi-isometric embedding Bridson1999
  • Proposition 2: $\ell_1$-product of trees is not hyperbolic
  • Lemma 1: Stability of geodesic triangles under quasi-isometric embeddings
  • proof
  • Lemma 2: Product of quasi-isometric embeddings
  • proof : Proof of Lemma \ref{['lem:prod-qi']}