Table of Contents
Fetching ...

Emergent Visual-Semantic Hierarchies in Image-Text Representations

Morris Alper, Hadar Averbuch-Elor

TL;DR

This work investigates whether pretrained vision–language models implicitly encode visual–semantic hierarchies and how to reveal and enhance this structure. It introduces Radial Embeddings (RE), a geometry-aligned probing and fine-tuning framework that uses an entailment root defined by the empty-string embedding and the exterior-angle measure $\Xi_{\mathbf{r}}(\cdot,\cdot)$ to capture hierarchical relationships. The authors pair RE with HierarCaps, a large four-tier caption hierarchy dataset (73K train, 1K test) to benchmark multimodal hierarchical understanding, and show zero-shot hierarchical knowledge in models like CLIP, with further gains from a text-only fine-tuning phase that preserves pretrained knowledge. Across analyses on HierarCaps and external lexical/hierarchical benchmarks, RE-based fine-tuning improves hierarchical metrics while largely leaving standard cross-modal retrieval performance intact, highlighting practical gains in hierarchical reasoning without retraining from scratch.

Abstract

While recent vision-and-language models (VLMs) like CLIP are a powerful tool for analyzing text and images in a shared semantic space, they do not explicitly model the hierarchical nature of the set of texts which may describe an image. Conversely, existing multimodal hierarchical representation learning methods require costly training from scratch, failing to leverage the knowledge encoded by state-of-the-art multimodal foundation models. In this work, we study the knowledge of existing foundation models, finding that they exhibit emergent understanding of visual-semantic hierarchies despite not being directly trained for this purpose. We propose the Radial Embedding (RE) framework for probing and optimizing hierarchical understanding, and contribute the HierarCaps dataset, a benchmark facilitating the study of hierarchical knowledge in image--text representations, constructed automatically via large language models. Our results show that foundation VLMs exhibit zero-shot hierarchical understanding, surpassing the performance of prior models explicitly designed for this purpose. Furthermore, we show that foundation models may be better aligned to hierarchical reasoning via a text-only fine-tuning phase, while retaining pretraining knowledge.

Emergent Visual-Semantic Hierarchies in Image-Text Representations

TL;DR

This work investigates whether pretrained vision–language models implicitly encode visual–semantic hierarchies and how to reveal and enhance this structure. It introduces Radial Embeddings (RE), a geometry-aligned probing and fine-tuning framework that uses an entailment root defined by the empty-string embedding and the exterior-angle measure to capture hierarchical relationships. The authors pair RE with HierarCaps, a large four-tier caption hierarchy dataset (73K train, 1K test) to benchmark multimodal hierarchical understanding, and show zero-shot hierarchical knowledge in models like CLIP, with further gains from a text-only fine-tuning phase that preserves pretrained knowledge. Across analyses on HierarCaps and external lexical/hierarchical benchmarks, RE-based fine-tuning improves hierarchical metrics while largely leaving standard cross-modal retrieval performance intact, highlighting practical gains in hierarchical reasoning without retraining from scratch.

Abstract

While recent vision-and-language models (VLMs) like CLIP are a powerful tool for analyzing text and images in a shared semantic space, they do not explicitly model the hierarchical nature of the set of texts which may describe an image. Conversely, existing multimodal hierarchical representation learning methods require costly training from scratch, failing to leverage the knowledge encoded by state-of-the-art multimodal foundation models. In this work, we study the knowledge of existing foundation models, finding that they exhibit emergent understanding of visual-semantic hierarchies despite not being directly trained for this purpose. We propose the Radial Embedding (RE) framework for probing and optimizing hierarchical understanding, and contribute the HierarCaps dataset, a benchmark facilitating the study of hierarchical knowledge in image--text representations, constructed automatically via large language models. Our results show that foundation VLMs exhibit zero-shot hierarchical understanding, surpassing the performance of prior models explicitly designed for this purpose. Furthermore, we show that foundation models may be better aligned to hierarchical reasoning via a text-only fine-tuning phase, while retaining pretraining knowledge.
Paper Structure (36 sections, 5 equations, 8 figures, 7 tables)

This paper contains 36 sections, 5 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: A single image may be described by many text of varying levels of descriptiveness. While SOTA multimodal foundation models are commonly used to retrieve a single text matching an image, we show that they have learned to model hierarchies. By applying our RE framework to foundation models, we may perform hierarchical image-text matching to place images and captions in the context of a visual-semantic hierarchy which encompasses the relative meanings of all possible images and texts. Above, we show a slice of the visual-semantic hierarchy obtained with our method, with $\emptyset$ indicating the root node in the hierarchy and arrows corresponding to the logical entailment relation between general and more specific descriptions.
  • Figure 2: Illustration of EC and RE optimization. Above we show examples of cases of positive loss under both frameworks. In the EC framework (left), any embedding in the cone $C_{\theta_\mathbf{r}(\mathbf{e})}(\mathbf{e})$ represents an item entailed by $\mathbf{e}$ (while $\mathbf{e}'$ is outside this cone). The half-aperture angle $\theta_\mathbf{r}(\mathbf{e})$ varies with distance from the root embedding $\mathbf{r}$ to enforce a partial order. During training, the deviation from this cone defines a margin loss. In our proposed RE framework (right), the loss is instead given by the difference between exterior angles of positive and negative examples, and with no dependency on $\theta_\mathbf{r}(\mathbf{e})$. In the above case, this loss is positive since the positive item $\mathbf{e}'$ has a larger exterior angle than the negative item $\mathbf{e}"$.
  • Figure 3: Sample item from the HierarCaps train set. Ground-truth captions have a four-tiered hierarchical structure. The first tier contains the most generic description matching the image (animal), the last contains the most specific description (a goat eating leaves...), and each (positive) tier is logically entailed by the following tier. The train set also contains corresponding negative captions; corresponding captions in the same tier logically contradict each other, and each positive caption is implied by both (positive and negative) captions in the following tier.
  • Figure 4: Dataset construction pipeline, used to create HierarCaps. Real image captions from Conceptual Captions are fed to a LLM with various prompts to produce caption hierarchies; above, * indicates the original caption, enriched with a full hierarchy of shorter, logically entailed captions. These are filtered with an NLI model to enforce logical entailment and augmented with few-shot LLM text completion, producing a set of seed hierarchies. We then distill these hierarchies into a much smaller language model $\mathcal{H}$ which learns to complete hierarchies in both directions; as described in Section \ref{['sec:dataset_construction']}, $\mathcal{H}$ is used to produce the final hierarchies in HierarCaps.
  • Figure 5: Qualitative hierarchical text-image matching results on HierarCaps (test set). The left column of each table shows hierarchical text-image matching applied using pretrained CLIP-Large, while the right column shows results on the same images after alignment (fine-tuning). The results above are abridged; for full results, see the supplementary material.
  • ...and 3 more figures