Table of Contents
Fetching ...

Cross-Modal Taxonomic Generalization in (Vision-) Language Models

Tianyang Xu, Marcelo Sandoval-Castaneda, Karen Livescu, Greg Shakhnarovich, Kanishka Misra

TL;DR

It is suggested that cross-modal generalization in LMs arises as a result of both coherence in the extralinguistic input and knowledge derived from language cues, as well as under counterfactual image-label mappings.

Abstract

What is the interplay between semantic representations learned by language models (LM) from surface form alone to those learned from more grounded evidence? We study this question for a scenario where part of the input comes from a different modality -- in our case, in a vision-language model (VLM), where a pretrained LM is aligned with a pretrained image encoder. As a case study, we focus on the task of predicting hypernyms of objects represented in images. We do so in a VLM setup where the image encoder and LM are kept frozen, and only the intermediate mappings are learned. We progressively deprive the VLM of explicit evidence for hypernyms, and test whether knowledge of hypernyms is recoverable from the LM. We find that the LMs we study can recover this knowledge and generalize even in the most extreme version of this experiment (when the model receives no evidence of a hypernym during training). Additional experiments suggest that this cross-modal taxonomic generalization persists under counterfactual image-label mappings only when the counterfactual data have high visual similarity within each category. Taken together, these findings suggest that cross-modal generalization in LMs arises as a result of both coherence in the extralinguistic input and knowledge derived from language cues.

Cross-Modal Taxonomic Generalization in (Vision-) Language Models

TL;DR

It is suggested that cross-modal generalization in LMs arises as a result of both coherence in the extralinguistic input and knowledge derived from language cues, as well as under counterfactual image-label mappings.

Abstract

What is the interplay between semantic representations learned by language models (LM) from surface form alone to those learned from more grounded evidence? We study this question for a scenario where part of the input comes from a different modality -- in our case, in a vision-language model (VLM), where a pretrained LM is aligned with a pretrained image encoder. As a case study, we focus on the task of predicting hypernyms of objects represented in images. We do so in a VLM setup where the image encoder and LM are kept frozen, and only the intermediate mappings are learned. We progressively deprive the VLM of explicit evidence for hypernyms, and test whether knowledge of hypernyms is recoverable from the LM. We find that the LMs we study can recover this knowledge and generalize even in the most extreme version of this experiment (when the model receives no evidence of a hypernym during training). Additional experiments suggest that this cross-modal taxonomic generalization persists under counterfactual image-label mappings only when the counterfactual data have high visual similarity within each category. Taken together, these findings suggest that cross-modal generalization in LMs arises as a result of both coherence in the extralinguistic input and knowledge derived from language cues.
Paper Structure (32 sections, 11 figures, 5 tables, 2 algorithms)

This paper contains 32 sections, 11 figures, 5 tables, 2 algorithms.

Figures (11)

  • Figure 1: An instance of our experiments. During training, the projector is deprived of explicit supervision on high-level categories (hypernyms, e.g., animal) at various amounts, and is trained to detect the presence (and absence) of lower-level categories (e.g., koala), keeping the image encoder and the LM backbone frozen. After training, the VLM is tested for generalization to hypernym categories, given previously unseen images.
  • Figure 2: Depiction of our two ablations, for a sampled toy training set containing several images each of crows, cardinals, and parrots.
  • Figure 3: Macro F1s of VLMs for predicting the hypernym category for unseen images across LM backbones and image encoder types (DINOv2 vs. SigLIP), in the experiment where all hypernyms are removed. Dashed line indicates majority-label baseline.
  • Figure 4: Macro F1 on unseen images across LM backbones, LM representations (pre-trained vs. random), and hypernym ablation type (Random vs. Systematic) at different amounts of exposure to hypernym categories, for various test splits. Dashed line indicates the majority-label baseline. Error bands represent 95% confidence intervals across random seeds ($N$ = 3) and categories (hypernyms = 53, leaves = 1216).
  • Figure 5: Examples of image-leaf mappings resulting from our counterfactual shuffles, in comparison with the original configuration (top). VC indicates the visual coherence of the category under the data configuration.
  • ...and 6 more figures