Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models
Zhuowan Li, Cihang Xie, Benjamin Van Durme, Alan Yuille
TL;DR
This study systematically compares visual representations from vision-language and vision-only pretrained models using a probing framework across five tasks that span semantics and localization. By freezing encoders and training lightweight heads, the authors show VL models excel at label-prediction tasks while vision-only models outperform in dense, spatial tasks like detection and segmentation, revealing a trade-off between semantic richness and localization fidelity. The work provides practical guidance on selecting pretrained models for downstream tasks and contributes empirical insight into how language influences visual representation learning. Overall, the findings underscore that language-oriented multimodal pretraining enhances semantic encoding but may dilute localization cues, informing future design of multimodal visual systems.
Abstract
Despite the impressive advancements achieved through vision-and-language pretraining, it remains unclear whether this joint learning paradigm can help understand each individual modality. In this work, we conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models by probing a broad range of tasks, aiming to assess the quality of the learned representations in a nuanced manner. Interestingly, our empirical observations suggest that vision-and-language models are better at label prediction tasks like object and attribute prediction, while vision-only models are stronger at dense prediction tasks that require more localized information. We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models. Code will be released at https://github.com/Lizw14/visual_probing
