Table of Contents
Fetching ...

Explaining, Verifying, and Aligning Semantic Hierarchies in Vision-Language Model Embeddings

Gesina Schwalbe, Mert Keser, Moritz Bayerkuhnlein, Edgar Heinert, Annika Mütze, Marvin Keller, Sparsh Tiwari, Georgii Mikriukov, Diedrich Wolter, Jae Hee Lee, Matthias Rottmann

Abstract

Vision-language model (VLM) encoders such as CLIP enable strong retrieval and zero-shot classification in a shared image-text embedding space, yet the semantic organization of this space is rarely inspected. We present a post-hoc framework to explain, verify, and align the semantic hierarchies induced by a VLM over a given set of child classes. First, we extract a binary hierarchy by agglomerative clustering of class centroids and name internal nodes by dictionary-based matching to a concept bank. Second, we quantify plausibility by comparing the extracted tree against human ontologies using efficient tree- and edge-level consistency measures, and we evaluate utility via explainable hierarchical tree-traversal inference with uncertainty-aware early stopping (UAES). Third, we propose an ontology-guided post-hoc alignment method that learns a lightweight embedding-space transformation, using UMAP to generate target neighborhoods from a desired hierarchy. Across 13 pretrained VLMs and 4 image datasets, our method finds systematic modality differences: image encoders are more discriminative, while text encoders induce hierarchies that better match human taxonomies. Overall, the results reveal a persistent trade-off between zero-shot accuracy and ontological plausibility and suggest practical routes to improve semantic alignment in shared embedding spaces.

Explaining, Verifying, and Aligning Semantic Hierarchies in Vision-Language Model Embeddings

Abstract

Vision-language model (VLM) encoders such as CLIP enable strong retrieval and zero-shot classification in a shared image-text embedding space, yet the semantic organization of this space is rarely inspected. We present a post-hoc framework to explain, verify, and align the semantic hierarchies induced by a VLM over a given set of child classes. First, we extract a binary hierarchy by agglomerative clustering of class centroids and name internal nodes by dictionary-based matching to a concept bank. Second, we quantify plausibility by comparing the extracted tree against human ontologies using efficient tree- and edge-level consistency measures, and we evaluate utility via explainable hierarchical tree-traversal inference with uncertainty-aware early stopping (UAES). Third, we propose an ontology-guided post-hoc alignment method that learns a lightweight embedding-space transformation, using UMAP to generate target neighborhoods from a desired hierarchy. Across 13 pretrained VLMs and 4 image datasets, our method finds systematic modality differences: image encoders are more discriminative, while text encoders induce hierarchies that better match human taxonomies. Overall, the results reveal a persistent trade-off between zero-shot accuracy and ontological plausibility and suggest practical routes to improve semantic alignment in shared embedding spaces.

Paper Structure

This paper contains 71 sections, 7 equations, 15 figures, 15 tables.

Figures (15)

  • Figure 1: Our post-hoc framework to explain (embed given concepts, cluster and name found parents ), verify, and alignthe semantic hierarchy induced by a VLM in its embedding space.
  • Figure 2: Running toy example on CIFAR-10 alex2009learning leaf classes.
  • Figure 3: Faithfulness (top) versus plausibility (bottom) results for different models and backbones, averaged over 3 datasets and 3 leaf encoding types (image, text, both).
  • Figure 4: Results of verifying and comparing encoders against a target ontology (\ref{['fig:faithfulness-by-leaf-mode']}) and each other (\ref{['fig:uted']}).
  • Figure 5: Comparing text embedding strategies: Top 1 zero-shot accuracy per text embedding strategy on CIFAR-10, CIFAR-100, and ImageNet and for different leaf embedding strategies (averaged). Top: averaged over models, bottom: per model.
  • ...and 10 more figures

Theorems & Definitions (7)

  • Definition 1: Ontology
  • Example 1
  • Example 2: Parent Relations
  • Example 3
  • Example 4: UAES
  • Definition 2
  • Example 5: Local Hierarchical Consistency