Table of Contents
Fetching ...

MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

Sara Papi, Maike Züfle, Marco Gaido, Beatrice Savoldi, Danni Liu, Ioannis Douros, Luisa Bentivogli, Jan Niehues

TL;DR

MCIF introduces the first manually annotated benchmark for crosslingual multimodal instruction-following, spanning text, speech, and video across English, German, Italian, and Chinese. Built from ACL 2023 talks, it includes short- and long-form contexts across 13 tasks grouped into recognition, translation, QA, and summarization, with two prompt variants (MCIFfix, MCIFmix). The authors benchmark 23 models from LLMs, SpeechLLMs, VideoLLMs, and MLLMs using standard metrics, revealing strong performance in translation and notable weaknesses in long-form processing and multimodal integration. Overall, MCIF exposes significant gaps between current models and robust crosslingual multimodal instruction-following, providing a rich resource and baseline for advancing the field.

Abstract

Recent advances in large language models have catalyzed the development of multimodal LLMs (MLLMs) that integrate text, speech, and vision within unified frameworks. As MLLMs evolve from narrow, monolingual, task-specific systems to general-purpose instruction-following models, a key frontier lies in evaluating their multilingual and multimodal capabilities over both long and short contexts. However, existing benchmarks fall short in evaluating these dimensions jointly: they are often limited to English, mostly focus on one single modality at a time, rely on short-form contexts, or lack human annotations -- hindering comprehensive assessment of model performance across languages, modalities, and task complexity. To address these gaps, we introduce MCIF (Multimodal Crosslingual Instruction Following), the first multilingual human-annotated benchmark based on scientific talks that is designed to evaluate instruction-following in crosslingual, multimodal settings over both short- and long-form inputs. MCIF spans three core modalities -- speech, vision, and text -- and four diverse languages (English, German, Italian, and Chinese), enabling a comprehensive evaluation of MLLMs' abilities to interpret instructions across languages and combine them with multimodal contextual information. MCIF is released under a CC-BY 4.0 license to encourage open research and progress in MLLMs development.

MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

TL;DR

MCIF introduces the first manually annotated benchmark for crosslingual multimodal instruction-following, spanning text, speech, and video across English, German, Italian, and Chinese. Built from ACL 2023 talks, it includes short- and long-form contexts across 13 tasks grouped into recognition, translation, QA, and summarization, with two prompt variants (MCIFfix, MCIFmix). The authors benchmark 23 models from LLMs, SpeechLLMs, VideoLLMs, and MLLMs using standard metrics, revealing strong performance in translation and notable weaknesses in long-form processing and multimodal integration. Overall, MCIF exposes significant gaps between current models and robust crosslingual multimodal instruction-following, providing a rich resource and baseline for advancing the field.

Abstract

Recent advances in large language models have catalyzed the development of multimodal LLMs (MLLMs) that integrate text, speech, and vision within unified frameworks. As MLLMs evolve from narrow, monolingual, task-specific systems to general-purpose instruction-following models, a key frontier lies in evaluating their multilingual and multimodal capabilities over both long and short contexts. However, existing benchmarks fall short in evaluating these dimensions jointly: they are often limited to English, mostly focus on one single modality at a time, rely on short-form contexts, or lack human annotations -- hindering comprehensive assessment of model performance across languages, modalities, and task complexity. To address these gaps, we introduce MCIF (Multimodal Crosslingual Instruction Following), the first multilingual human-annotated benchmark based on scientific talks that is designed to evaluate instruction-following in crosslingual, multimodal settings over both short- and long-form inputs. MCIF spans three core modalities -- speech, vision, and text -- and four diverse languages (English, German, Italian, and Chinese), enabling a comprehensive evaluation of MLLMs' abilities to interpret instructions across languages and combine them with multimodal contextual information. MCIF is released under a CC-BY 4.0 license to encourage open research and progress in MLLMs development.

Paper Structure

This paper contains 22 sections, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Breakdown of MCIF statistics. Total length is measured in space-separated words for English, German, and Italian, and in characters for Chinese. Question-answer statistics in the inner circle refer to the question type, while the outer circle refers to the input modality (see \ref{['subsec:annotations']}).
  • Figure 2: MLLM results on MCIFmix by inference modality, averaged across languages.
  • Figure 3: Performance breakdown on MCIFmixLONG QA of the best models by question modality and source.
  • Figure :