Cross-Modal Taxonomic Generalization in (Vision-) Language Models

Tianyang Xu; Marcelo Sandoval-Castaneda; Karen Livescu; Greg Shakhnarovich; Kanishka Misra

Cross-Modal Taxonomic Generalization in (Vision-) Language Models

Tianyang Xu, Marcelo Sandoval-Castaneda, Karen Livescu, Greg Shakhnarovich, Kanishka Misra

TL;DR

It is suggested that cross-modal generalization in LMs arises as a result of both coherence in the extralinguistic input and knowledge derived from language cues, as well as under counterfactual image-label mappings.

Abstract

What is the interplay between semantic representations learned by language models (LM) from surface form alone to those learned from more grounded evidence? We study this question for a scenario where part of the input comes from a different modality -- in our case, in a vision-language model (VLM), where a pretrained LM is aligned with a pretrained image encoder. As a case study, we focus on the task of predicting hypernyms of objects represented in images. We do so in a VLM setup where the image encoder and LM are kept frozen, and only the intermediate mappings are learned. We progressively deprive the VLM of explicit evidence for hypernyms, and test whether knowledge of hypernyms is recoverable from the LM. We find that the LMs we study can recover this knowledge and generalize even in the most extreme version of this experiment (when the model receives no evidence of a hypernym during training). Additional experiments suggest that this cross-modal taxonomic generalization persists under counterfactual image-label mappings only when the counterfactual data have high visual similarity within each category. Taken together, these findings suggest that cross-modal generalization in LMs arises as a result of both coherence in the extralinguistic input and knowledge derived from language cues.

Cross-Modal Taxonomic Generalization in (Vision-) Language Models

TL;DR

Abstract

Paper Structure (32 sections, 11 figures, 5 tables, 2 algorithms)

This paper contains 32 sections, 11 figures, 5 tables, 2 algorithms.

Introduction
Related Work
Data, Models, and Methods
Stimuli and Measures
Modeling Choices
Experiment Design
Random Hypernym Ablation
Systematic Hypernym Ablation
Experiments
Preconditions to Generalization
Hypernymy knowledge in LMs
Impact of text supervision in image encoders
Do our models learn the task?
Cross-Modal Taxonomic Generalization
On the Arbitrariness of Cross-Modal Taxonomic Generalization
...and 17 more sections

Figures (11)

Figure 1: An instance of our experiments. During training, the projector is deprived of explicit supervision on high-level categories (hypernyms, e.g., animal) at various amounts, and is trained to detect the presence (and absence) of lower-level categories (e.g., koala), keeping the image encoder and the LM backbone frozen. After training, the VLM is tested for generalization to hypernym categories, given previously unseen images.
Figure 2: Depiction of our two ablations, for a sampled toy training set containing several images each of crows, cardinals, and parrots.
Figure 3: Macro F1s of VLMs for predicting the hypernym category for unseen images across LM backbones and image encoder types (DINOv2 vs. SigLIP), in the experiment where all hypernyms are removed. Dashed line indicates majority-label baseline.
Figure 4: Macro F1 on unseen images across LM backbones, LM representations (pre-trained vs. random), and hypernym ablation type (Random vs. Systematic) at different amounts of exposure to hypernym categories, for various test splits. Dashed line indicates the majority-label baseline. Error bands represent 95% confidence intervals across random seeds ($N$ = 3) and categories (hypernyms = 53, leaves = 1216).
Figure 5: Examples of image-leaf mappings resulting from our counterfactual shuffles, in comparison with the original configuration (top). VC indicates the visual coherence of the category under the data configuration.
...and 6 more figures

Cross-Modal Taxonomic Generalization in (Vision-) Language Models

TL;DR

Abstract

Cross-Modal Taxonomic Generalization in (Vision-) Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (11)