Table of Contents
Fetching ...

Multimodal Task Interference: A Benchmark and Analysis of History-Target Mismatch in Multimodal LLMs

Masayuki Kawarada, Tatsuya Ishigaki, Hiroya Takamura

Abstract

Task interference, the performance degradation caused by task switches within a single conversation, has been studied exclusively in text-only settings despite the growing prevalence of multimodal dialogue systems. We introduce a benchmark for evaluating this phenomenon in multimodal LLMs, covering six tasks across text and vision with systematic variation of history-target along three axes: modality mismatch, reasoning mismatch, and answer format mismatch. Experiments on both open-weights and proprietary models reveal that task interference is highly directional: switching from text-only to image-based targets causes severe performance drops, while the reverse transition yields minimal degradation. Interference is further amplified when mismatches co-occur across multiple dimensions, and is driven most strongly by modality differences, followed by answer format, while reasoning requirement shifts cause minimal degradation.

Multimodal Task Interference: A Benchmark and Analysis of History-Target Mismatch in Multimodal LLMs

Abstract

Task interference, the performance degradation caused by task switches within a single conversation, has been studied exclusively in text-only settings despite the growing prevalence of multimodal dialogue systems. We introduce a benchmark for evaluating this phenomenon in multimodal LLMs, covering six tasks across text and vision with systematic variation of history-target along three axes: modality mismatch, reasoning mismatch, and answer format mismatch. Experiments on both open-weights and proprietary models reveal that task interference is highly directional: switching from text-only to image-based targets causes severe performance drops, while the reverse transition yields minimal degradation. Interference is further amplified when mismatches co-occur across multiple dimensions, and is driven most strongly by modality differences, followed by answer format, while reasoning requirement shifts cause minimal degradation.
Paper Structure (20 sections, 2 equations, 3 figures, 3 tables)

This paper contains 20 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: An illustrative example of multimodal task interference. After processing a conversational history of image captioning tasks, the model is prompted with a text-only question. The sudden task switch causes the model to erroneously expect a visual input, leading to a failure in answering a simple factual question.
  • Figure 2: A heatmap visualizing the performance drop (relative change in %) across all pairwise combinations of history and target datasets for GPT-4.1-mini with a history length of $N=3$.
  • Figure 3: Performance difference ($\Delta$) between mismatch and match conditions across varying history lengths ($N=1, 3, 5$) for modality, reasoning, and answer format dimensions.