Multimodal Task Interference: A Benchmark and Analysis of History-Target Mismatch in Multimodal LLMs

Masayuki Kawarada; Tatsuya Ishigaki; Hiroya Takamura

Multimodal Task Interference: A Benchmark and Analysis of History-Target Mismatch in Multimodal LLMs

Masayuki Kawarada, Tatsuya Ishigaki, Hiroya Takamura

Abstract

Task interference, the performance degradation caused by task switches within a single conversation, has been studied exclusively in text-only settings despite the growing prevalence of multimodal dialogue systems. We introduce a benchmark for evaluating this phenomenon in multimodal LLMs, covering six tasks across text and vision with systematic variation of history-target along three axes: modality mismatch, reasoning mismatch, and answer format mismatch. Experiments on both open-weights and proprietary models reveal that task interference is highly directional: switching from text-only to image-based targets causes severe performance drops, while the reverse transition yields minimal degradation. Interference is further amplified when mismatches co-occur across multiple dimensions, and is driven most strongly by modality differences, followed by answer format, while reasoning requirement shifts cause minimal degradation.

Multimodal Task Interference: A Benchmark and Analysis of History-Target Mismatch in Multimodal LLMs

Abstract

Paper Structure (20 sections, 2 equations, 3 figures, 3 tables)

This paper contains 20 sections, 2 equations, 3 figures, 3 tables.

Introduction
Related Work
Multimodal Large Language Models
Task Interference
Task Interference
Experiments
Target Tasks and Datasets
Multimodal Large Language Models
Experimental Setup
Evaluation Metrics
Results and Discussion
Effects of Task Interference along Three Axes
Modality Mismatch
Reasoning Mismatch
Answer Format Mismatch
...and 5 more sections

Figures (3)

Figure 1: An illustrative example of multimodal task interference. After processing a conversational history of image captioning tasks, the model is prompted with a text-only question. The sudden task switch causes the model to erroneously expect a visual input, leading to a failure in answering a simple factual question.
Figure 2: A heatmap visualizing the performance drop (relative change in %) across all pairwise combinations of history and target datasets for GPT-4.1-mini with a history length of $N=3$.
Figure 3: Performance difference ($\Delta$) between mismatch and match conditions across varying history lengths ($N=1, 3, 5$) for modality, reasoning, and answer format dimensions.

Multimodal Task Interference: A Benchmark and Analysis of History-Target Mismatch in Multimodal LLMs

Abstract

Multimodal Task Interference: A Benchmark and Analysis of History-Target Mismatch in Multimodal LLMs

Authors

Abstract

Table of Contents

Figures (3)