Table of Contents
Fetching ...

Multimodal LLMs Do Not Compose Skills Optimally Across Modalities

Paula Ontalvilla, Aitor Ormazabal, Gorka Azkune

TL;DR

Multimodal Large Language Models struggle to optimally compose their visual and textual skills across modalities, even for straightforward sequential tasks. The authors introduce three image-to-text tasks that require combining a visual skill with a textual one and compare natural direct inference with cascaded inference that enforces skill composition. They test open-source MLLMs and explore two mitigation strategies—composition-specific chain-of-thought prompting and cross-modal fine-tuning—finding improvements but not a full closure of the cross-modal skill composition gap. The results highlight a systematic, cross-model gap in cross-modal skill integration and motivate further research into underlying mechanisms and robust training approaches. Overall, the work informs future directions in cross-modal alignment, representation learning, and skill composition for multimodal AI systems.

Abstract

Skill composition is the ability to combine previously learned skills to solve new tasks. As neural networks acquire increasingly complex skills during their pretraining, it is not clear how successfully they can compose them. In this paper, we focus on Multimodal Large Language Models (MLLM), and study their ability to compose skills across modalities. To this end, we design three evaluation tasks which can be solved sequentially composing two modality-dependent skills, and evaluate several open MLLMs under two main settings: i) prompting the model to directly solve the task, and ii) using a two-step cascaded inference approach, which manually enforces the composition of the two skills for a given task. Even with these straightforward compositions, we find that all evaluated MLLMs exhibit a significant cross-modality skill composition gap. To mitigate the aforementioned gap, we explore two alternatives: i) use chain-of-thought prompting to explicitly instruct MLLMs for skill composition and ii) a specific fine-tuning recipe to promote skill composition. Although those strategies improve model performance, they still exhibit significant skill composition gaps, suggesting that more research is needed to improve cross-modal skill composition in MLLMs.

Multimodal LLMs Do Not Compose Skills Optimally Across Modalities

TL;DR

Multimodal Large Language Models struggle to optimally compose their visual and textual skills across modalities, even for straightforward sequential tasks. The authors introduce three image-to-text tasks that require combining a visual skill with a textual one and compare natural direct inference with cascaded inference that enforces skill composition. They test open-source MLLMs and explore two mitigation strategies—composition-specific chain-of-thought prompting and cross-modal fine-tuning—finding improvements but not a full closure of the cross-modal skill composition gap. The results highlight a systematic, cross-model gap in cross-modal skill integration and motivate further research into underlying mechanisms and robust training approaches. Overall, the work informs future directions in cross-modal alignment, representation learning, and skill composition for multimodal AI systems.

Abstract

Skill composition is the ability to combine previously learned skills to solve new tasks. As neural networks acquire increasingly complex skills during their pretraining, it is not clear how successfully they can compose them. In this paper, we focus on Multimodal Large Language Models (MLLM), and study their ability to compose skills across modalities. To this end, we design three evaluation tasks which can be solved sequentially composing two modality-dependent skills, and evaluate several open MLLMs under two main settings: i) prompting the model to directly solve the task, and ii) using a two-step cascaded inference approach, which manually enforces the composition of the two skills for a given task. Even with these straightforward compositions, we find that all evaluated MLLMs exhibit a significant cross-modality skill composition gap. To mitigate the aforementioned gap, we explore two alternatives: i) use chain-of-thought prompting to explicitly instruct MLLMs for skill composition and ii) a specific fine-tuning recipe to promote skill composition. Although those strategies improve model performance, they still exhibit significant skill composition gaps, suggesting that more research is needed to improve cross-modal skill composition in MLLMs.

Paper Structure

This paper contains 37 sections, 6 figures, 19 tables.

Figures (6)

  • Figure 1: We define three tasks that can be solved with a trivial sequential composition of two modality-specific skills. We evaluate MLLMs on those tasks using direct inference, i.e. standard inference. We compare the performance with the cascaded inference, where we manually induce the required skill composition with two calls to the MLLM.
  • Figure 2: Prompt used in the oracle setup for Task 2
  • Figure 3: Example from the Sort dataset.
  • Figure 4: Example from the Sum dataset.
  • Figure 5: Example generation from LLaVA 1.6-Mistral-7B using the direct inference setup and the CoT strategy on the GSM8K dataset.
  • ...and 1 more figures