Table of Contents
Fetching ...

Cross-Modal Projection in Multimodal LLMs Doesn't Really Project Visual Attributes to Textual Space

Gaurav Verma, Minje Choi, Kartik Sharma, Jamelle Watson-Daniels, Sejoon Oh, Srijan Kumar

TL;DR

This work investigates whether domain-specific visual attributes in multimodal LLMs are learned by the cross-modal projection or by the LLM itself. By comparing two fine-tuning strategies (FT-Proj and FT-E2E) across four domain datasets and introducing a post-projection richness assessment via an independent MLP, the authors find that domain-specific attributes are predominantly encoded by the LLM parameters, even when the projection is updated or frozen. End-to-end fine-tuning yields larger gains due to the LLM's larger capacity, suggesting the projection's role is more about leveraging existing LLM knowledge than mapping new attributes into the LLM space. These findings motivate a reinterpretation of cross-modal projection in MLLMs and provide guidance for design and interpretability efforts in domain-adapted multimodal systems. The work points to where fine-tuning should be focused for domain-specific tasks and outlines avenues for deeper analyses of multimodal architectures.

Abstract

Multimodal large language models (MLLMs) like LLaVA and GPT-4(V) enable general-purpose conversations about images with the language modality. As off-the-shelf MLLMs may have limited capabilities on images from domains like dermatology and agriculture, they must be fine-tuned to unlock domain-specific applications. The prevalent architecture of current open-source MLLMs comprises two major modules: an image-language (cross-modal) projection network and a large language model. It is desirable to understand the roles of these two modules in modeling domain-specific visual attributes to inform the design of future models and streamline the interpretability efforts on the current models. To this end, via experiments on 4 datasets and under 2 fine-tuning settings, we find that as the MLLM is fine-tuned, it indeed gains domain-specific visual capabilities, but the updates do not lead to the projection extracting relevant domain-specific visual attributes. Our results indicate that the domain-specific visual attributes are modeled by the LLM, even when only the projection is fine-tuned. Through this study, we offer a potential reinterpretation of the role of cross-modal projections in MLLM architectures. Project webpage: https://claws-lab.github.io/projection-in-MLLMs/

Cross-Modal Projection in Multimodal LLMs Doesn't Really Project Visual Attributes to Textual Space

TL;DR

This work investigates whether domain-specific visual attributes in multimodal LLMs are learned by the cross-modal projection or by the LLM itself. By comparing two fine-tuning strategies (FT-Proj and FT-E2E) across four domain datasets and introducing a post-projection richness assessment via an independent MLP, the authors find that domain-specific attributes are predominantly encoded by the LLM parameters, even when the projection is updated or frozen. End-to-end fine-tuning yields larger gains due to the LLM's larger capacity, suggesting the projection's role is more about leveraging existing LLM knowledge than mapping new attributes into the LLM space. These findings motivate a reinterpretation of cross-modal projection in MLLMs and provide guidance for design and interpretability efforts in domain-adapted multimodal systems. The work points to where fine-tuning should be focused for domain-specific tasks and outlines avenues for deeper analyses of multimodal architectures.

Abstract

Multimodal large language models (MLLMs) like LLaVA and GPT-4(V) enable general-purpose conversations about images with the language modality. As off-the-shelf MLLMs may have limited capabilities on images from domains like dermatology and agriculture, they must be fine-tuned to unlock domain-specific applications. The prevalent architecture of current open-source MLLMs comprises two major modules: an image-language (cross-modal) projection network and a large language model. It is desirable to understand the roles of these two modules in modeling domain-specific visual attributes to inform the design of future models and streamline the interpretability efforts on the current models. To this end, via experiments on 4 datasets and under 2 fine-tuning settings, we find that as the MLLM is fine-tuned, it indeed gains domain-specific visual capabilities, but the updates do not lead to the projection extracting relevant domain-specific visual attributes. Our results indicate that the domain-specific visual attributes are modeled by the LLM, even when only the projection is fine-tuned. Through this study, we offer a potential reinterpretation of the role of cross-modal projections in MLLM architectures. Project webpage: https://claws-lab.github.io/projection-in-MLLMs/
Paper Structure (11 sections, 3 figures, 3 tables)

This paper contains 11 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of our study. While the MLLM's domain-specific visual capability can be improved using fine-tuning strategies, the domain-specific richness of the image's post-projection representation does not improve. Results indicate that domain-specific visual attributes are predominantly modeled by the LLM parameters (whether frozen or not) and the projection does not necessarily play a role in mapping visual attributes to the LLM space.
  • Figure 2: Architecture of the MLLM considered in this study. $\phi$ and $\theta$ denote tunable parameters of the projection and the large language model, respectively.
  • Figure 3: Illustration of the $4$ domain-specific image classification datasets used in this study. The datasets are from diverse domains; for brevity we only show some of the representative labels from each of the datasets. Images best viewed with zoom.