Table of Contents
Fetching ...

Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models

Zhuowan Li, Cihang Xie, Benjamin Van Durme, Alan Yuille

TL;DR

This study systematically compares visual representations from vision-language and vision-only pretrained models using a probing framework across five tasks that span semantics and localization. By freezing encoders and training lightweight heads, the authors show VL models excel at label-prediction tasks while vision-only models outperform in dense, spatial tasks like detection and segmentation, revealing a trade-off between semantic richness and localization fidelity. The work provides practical guidance on selecting pretrained models for downstream tasks and contributes empirical insight into how language influences visual representation learning. Overall, the findings underscore that language-oriented multimodal pretraining enhances semantic encoding but may dilute localization cues, informing future design of multimodal visual systems.

Abstract

Despite the impressive advancements achieved through vision-and-language pretraining, it remains unclear whether this joint learning paradigm can help understand each individual modality. In this work, we conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models by probing a broad range of tasks, aiming to assess the quality of the learned representations in a nuanced manner. Interestingly, our empirical observations suggest that vision-and-language models are better at label prediction tasks like object and attribute prediction, while vision-only models are stronger at dense prediction tasks that require more localized information. We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models. Code will be released at https://github.com/Lizw14/visual_probing

Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models

TL;DR

This study systematically compares visual representations from vision-language and vision-only pretrained models using a probing framework across five tasks that span semantics and localization. By freezing encoders and training lightweight heads, the authors show VL models excel at label-prediction tasks while vision-only models outperform in dense, spatial tasks like detection and segmentation, revealing a trade-off between semantic richness and localization fidelity. The work provides practical guidance on selecting pretrained models for downstream tasks and contributes empirical insight into how language influences visual representation learning. Overall, the findings underscore that language-oriented multimodal pretraining enhances semantic encoding but may dilute localization cues, informing future design of multimodal visual systems.

Abstract

Despite the impressive advancements achieved through vision-and-language pretraining, it remains unclear whether this joint learning paradigm can help understand each individual modality. In this work, we conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models by probing a broad range of tasks, aiming to assess the quality of the learned representations in a nuanced manner. Interestingly, our empirical observations suggest that vision-and-language models are better at label prediction tasks like object and attribute prediction, while vision-only models are stronger at dense prediction tasks that require more localized information. We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models. Code will be released at https://github.com/Lizw14/visual_probing
Paper Structure (29 sections, 3 figures, 8 tables)

This paper contains 29 sections, 3 figures, 8 tables.

Figures (3)

  • Figure 1: We compare the visual representations from unimodal and multimodal models on five tasks, in order to probe the semantics and localization knowledge encoded in the representations.
  • Figure 2: Compared to vision-and-language models, vision-only models more accurately predict the boundary of segmentation masks, but make mistakes in labeling the regions.
  • Figure 3: A closer look at the attribute prediction results by separately evaluating different types of attributes. The advantage of VL models is more significant in the more abstract categories (e.g., action) than visually grounded categories (e.g., texture).