Table of Contents
Fetching ...

ARGENT: Adaptive Hierarchical Image-Text Representations

Chuong Huynh, Hossein Souri, Abhinav Kumar, Vitali Petsiuk, Deen Dayal Mohan, Suren Kumar

Abstract

Large-scale Vision-Language Models (VLMs) such as CLIP learn powerful semantic representations but operate in Euclidean space, which fails to capture the inherent hierarchical structure of visual and linguistic concepts. Hyperbolic geometry, with its exponential volume growth, offers a principled alternative for embedding such hierarchies with low distortion. However, existing hyperbolic VLMs use entailment losses that are unstable: as parent embeddings contract toward the origin, their entailment cones widen toward a half-space, causing catastrophic cone collapse that destroys the intended hierarchy. Additionally, hierarchical evaluation of these models remains unreliable, being largely retrieval-based and correlation-based metrics and prone to taxonomy dependence and ambiguous negatives. To address these limitations, we propose an adaptive entailment loss paired with a norm regularizer that prevents cone collapse without heuristic aperture clipping. We further introduce an angle-based probabilistic entailment protocol (PEP) for evaluating hierarchical understanding, scored with AUC-ROC and Average Precision. This paper introduces a stronger hyperbolic VLM baseline ARGENT, Adaptive hieRarchical imaGe-tExt represeNTation. ARGENT improves the SOTA hyperbolic VLM by 0.7, 1.1, and 0.8 absolute points on image classification, text-to-image retrieval, and proposed hierarchical metrics, respectively.

ARGENT: Adaptive Hierarchical Image-Text Representations

Abstract

Large-scale Vision-Language Models (VLMs) such as CLIP learn powerful semantic representations but operate in Euclidean space, which fails to capture the inherent hierarchical structure of visual and linguistic concepts. Hyperbolic geometry, with its exponential volume growth, offers a principled alternative for embedding such hierarchies with low distortion. However, existing hyperbolic VLMs use entailment losses that are unstable: as parent embeddings contract toward the origin, their entailment cones widen toward a half-space, causing catastrophic cone collapse that destroys the intended hierarchy. Additionally, hierarchical evaluation of these models remains unreliable, being largely retrieval-based and correlation-based metrics and prone to taxonomy dependence and ambiguous negatives. To address these limitations, we propose an adaptive entailment loss paired with a norm regularizer that prevents cone collapse without heuristic aperture clipping. We further introduce an angle-based probabilistic entailment protocol (PEP) for evaluating hierarchical understanding, scored with AUC-ROC and Average Precision. This paper introduces a stronger hyperbolic VLM baseline ARGENT, Adaptive hieRarchical imaGe-tExt represeNTation. ARGENT improves the SOTA hyperbolic VLM by 0.7, 1.1, and 0.8 absolute points on image classification, text-to-image retrieval, and proposed hierarchical metrics, respectively.
Paper Structure (15 sections, 11 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 15 sections, 11 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: ARGENT improves both hierarchical training and evaluation.(a) ARGENT outperforms the HyCoCLIP baseline on downstream tasks and our hierarchical benchmark (PEP AUC). (b) Our new Probabilistic Entailment Score ($p_{\text{Ent}}$) offers a more discriminative evaluation; while both models may achieve 100% correlation, our metric correctly identifies $VLM_A$ as superior.
  • Figure 2: Behavior of Adaptive Entailment $\mathcal{L}_\text{AdaEnt}$ and standard Entailment Loss $\mathcal{L}_\text{Ent}$. The figure highlights two cases in a top-down view of hyperbolic space: (1)Inside the norm boundary ($\|\tilde{\mathbf{y}}\| \le \frac{2C}{\sqrt{\kappa}}$): The standard $\mathcal{L}_\text{Ent}$ collapses to zero for all $\mathbf{x}$ in the non-origin half-space, even when the exterior angle $\phi(\mathbf{x},\mathbf{y})$ is large. Our $\mathcal{L}_\text{AdaEnt}$ remains active ($\mathcal{L}_\text{AdaEnt} \gg 0$), preventing vanishing gradients. (2)Outside the norm boundary: $\mathcal{L}_\text{Ent}$ penalizes the likely noisy positive $\mathbf{x}_2$ and the true negative $\mathbf{x}_1$ with the same value. Our $\mathcal{L}_\text{AdaEnt}$ adaptively assigns a lower loss to $\mathbf{x}_2$ while strongly penalizing $\mathbf{x}_1$.
  • Figure 3: Analysis of the vanilla and proposed adaptive entailment loss. (a) The norm constraint required by the standard loss. (b) A comparison between loss functions. (c) The effect of our adaptive weight. (d) The behavior of our norm regularizer.
  • Figure 4: Weighing factor $h(\mathbf{x}, \mathbf{y})$ (scaling) the distance (negative of similarity) for intra and inter-modality samples in MERU, HyCoCLIP and ARGENT. ARGENT uses adaptive weights compared to constant weights in MERU and HyCoCLIP.
  • Figure 5: Recall@1 of the BLIP-L model in image-to-text retrieval task on HierarCaps. Level 1 and 4 denotes the most generic and the most specific captions. We restrict the candidate pool to a local set containing ground-truth captions and the top-5 most similar (presumed false negative) captions. The model performance degrades when the text is more generic (Level 1).
  • ...and 3 more figures