Unveiling Ontological Commitment in Multi-Modal Foundation Models

Mert Keser; Gesina Schwalbe; Niki Amini-Naieni; Matthias Rottmann; Alois Knoll

Unveiling Ontological Commitment in Multi-Modal Foundation Models

Mert Keser, Gesina Schwalbe, Niki Amini-Naieni, Matthias Rottmann, Alois Knoll

TL;DR

An initial evaluation study shows that meaningful ontological class hierarchies can be extracted from state-of-the-art foundation models and demonstrates how to validate and verify a DNN's learned representations against given ontologies.

Abstract

Ontological commitment, i.e., used concepts, relations, and assumptions, are a corner stone of qualitative reasoning (QR) models. The state-of-the-art for processing raw inputs, though, are deep neural networks (DNNs), nowadays often based off from multimodal foundation models. These automatically learn rich representations of concepts and respective reasoning. Unfortunately, the learned qualitative knowledge is opaque, preventing easy inspection, validation, or adaptation against available QR models. So far, it is possible to associate pre-defined concepts with latent representations of DNNs, but extractable relations are mostly limited to semantic similarity. As a next step towards QR for validation and verification of DNNs: Concretely, we propose a method that extracts the learned superclass hierarchy from a multimodal DNN for a given set of leaf concepts. Under the hood we (1) obtain leaf concept embeddings using the DNN's textual input modality; (2) apply hierarchical clustering to them, using that DNNs encode semantic similarities via vector distances; and (3) label the such-obtained parent concepts using search in available ontologies from QR. An initial evaluation study shows that meaningful ontological class hierarchies can be extracted from state-of-the-art foundation models. Furthermore, we demonstrate how to validate and verify a DNN's learned representations against given ontologies. Lastly, we discuss potential future applications in the context of QR.

Unveiling Ontological Commitment in Multi-Modal Foundation Models

TL;DR

Abstract

Paper Structure (47 sections, 3 equations, 4 figures, 3 tables)

This paper contains 47 sections, 3 equations, 4 figures, 3 tables.

The prospect.
The problem.
Approach.
Contributions.
Related Work
Extraction of learned ontologies.
Background
Deep neural network representations
DNNs.
Latent representations.
Concept embeddings.
Text-to-image alignment.
Ontologies
Hierarchical clustering
Approach
...and 32 more sections

Figures (4)

Figure 1: Illustration of the approach for ontology extraction from multimodal DNNs: For extraction, (1) obtain leaf nodes ($\text{\normalfont\smallercat}$, $\text{\normalfont\smallerdog}$, $\text{\normalfont\smallercar}$) as the latent representations of their textual descriptions; (2) cluster these to get parent representations (dotted); (3) assign parents the closest concept ($\text{\normalfont\smalleranimal}$) from a concept bank. For inference check at each level similarity against nodes' latent representations (e.g., first $\text{\normalfont\smalleranimal}$ vs. $\text{\normalfont\smallercar}$).
Figure 2: Comparison of two superclass hierarchies for given leaf concepts (blue) from CIFAR-10 alex2009learning extracted from the large ViT-L-14 (left; with optimized prompt; 92% accuracy) and the smaller ResNet-50 (right; 46% accuracy) CLIP backbones with optimal distance metric settings. It shows the positive influence of model quality and prompt optimization (using "a photo of a $\text{\normalfont\smallerclass}$" instead of "$\text{\normalfont\smallerclass}$") on the plausibility of the extracted ontology, and how the human-alignedness accuracy serves as indicator for it.
Figure 3: Two exemplary ontological commitments: class hierarchies of the given leaf classes $\text{\normalfont\smallerfrog}$, $\text{\normalfont\smallercat}$, $\text{\normalfont\smallerdog}$, $\text{\normalfont\smallerhorse}$, differentiating by (a) biology (mammal vs. amphibian), (b) image background (a Clever Hans effect!).
Figure 4: Visualization of the latent space representations of CIFAR-10 embeddings in different CLIP model backbones (one color per class), generated using the distance-preserving t-SNE dimensionality reduction method maaten2008visualizing. The better class separation in the transformer-based backbones (b), c)) are consistent with fidelity and human-alignedness results in Tabs. \ref{['tab:prompt-engineering']}, \ref{['tab:text-vs-image-encoding']}.

Theorems & Definitions (2)

Definition 1: Ontology
Remark 1

Unveiling Ontological Commitment in Multi-Modal Foundation Models

TL;DR

Abstract

Unveiling Ontological Commitment in Multi-Modal Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Figures (4)

Theorems & Definitions (2)