Emergent Visual-Semantic Hierarchies in Image-Text Representations
Morris Alper, Hadar Averbuch-Elor
TL;DR
This work investigates whether pretrained vision–language models implicitly encode visual–semantic hierarchies and how to reveal and enhance this structure. It introduces Radial Embeddings (RE), a geometry-aligned probing and fine-tuning framework that uses an entailment root defined by the empty-string embedding and the exterior-angle measure $\Xi_{\mathbf{r}}(\cdot,\cdot)$ to capture hierarchical relationships. The authors pair RE with HierarCaps, a large four-tier caption hierarchy dataset (73K train, 1K test) to benchmark multimodal hierarchical understanding, and show zero-shot hierarchical knowledge in models like CLIP, with further gains from a text-only fine-tuning phase that preserves pretrained knowledge. Across analyses on HierarCaps and external lexical/hierarchical benchmarks, RE-based fine-tuning improves hierarchical metrics while largely leaving standard cross-modal retrieval performance intact, highlighting practical gains in hierarchical reasoning without retraining from scratch.
Abstract
While recent vision-and-language models (VLMs) like CLIP are a powerful tool for analyzing text and images in a shared semantic space, they do not explicitly model the hierarchical nature of the set of texts which may describe an image. Conversely, existing multimodal hierarchical representation learning methods require costly training from scratch, failing to leverage the knowledge encoded by state-of-the-art multimodal foundation models. In this work, we study the knowledge of existing foundation models, finding that they exhibit emergent understanding of visual-semantic hierarchies despite not being directly trained for this purpose. We propose the Radial Embedding (RE) framework for probing and optimizing hierarchical understanding, and contribute the HierarCaps dataset, a benchmark facilitating the study of hierarchical knowledge in image--text representations, constructed automatically via large language models. Our results show that foundation VLMs exhibit zero-shot hierarchical understanding, surpassing the performance of prior models explicitly designed for this purpose. Furthermore, we show that foundation models may be better aligned to hierarchical reasoning via a text-only fine-tuning phase, while retaining pretraining knowledge.
