Table of Contents
Fetching ...

UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark

Yanlin Li, Minghui Guo, Kaiwen Zhang, Shize Zhang, Yiran Zhao, Haodong Li, Congyue Zhou, Weijie Zheng, Yushen Yan, Shengqiong Wu, Wei Ji, Lei Cui, Furu Wei, Hao Fei, Mong-Li Lee, Wynne Hsu

TL;DR

This paper introduces the UniM benchmark, the first Unified Any-to-Any Interleaved Multimodal dataset, and proposes UniMA, an agentic baseline model equipped with traceable reasoning for structured interleaved generation.

Abstract

In real-world multimodal applications, systems usually need to comprehend arbitrarily combined and interleaved multimodal inputs from users, while also generating outputs in any interleaved multimedia form. This capability defines the goal of any-to-any interleaved multimodal learning under a unified paradigm of understanding and generation, posing new challenges and opportunities for advancing Multimodal Large Language Models (MLLMs). To foster and benchmark this capability, this paper introduces the UniM benchmark, the first Unified Any-to-Any Interleaved Multimodal dataset. UniM contains 31K high-quality instances across 30 domains and 7 representative modalities: text, image, audio, video, document, code, and 3D, each requiring multiple intertwined reasoning and generation capabilities. We further introduce the UniM Evaluation Suite, which assesses models along three dimensions: Semantic Correctness & Generation Quality, Response Structure Integrity, and Interleaved Coherence. In addition, we propose UniMA, an agentic baseline model equipped with traceable reasoning for structured interleaved generation. Comprehensive experiments demonstrate the difficulty of UniM and highlight key challenges and directions for advancing unified any-to-any multimodal intelligence. The project page is https://any2any-mllm.github.io/unim.

UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark

TL;DR

This paper introduces the UniM benchmark, the first Unified Any-to-Any Interleaved Multimodal dataset, and proposes UniMA, an agentic baseline model equipped with traceable reasoning for structured interleaved generation.

Abstract

In real-world multimodal applications, systems usually need to comprehend arbitrarily combined and interleaved multimodal inputs from users, while also generating outputs in any interleaved multimedia form. This capability defines the goal of any-to-any interleaved multimodal learning under a unified paradigm of understanding and generation, posing new challenges and opportunities for advancing Multimodal Large Language Models (MLLMs). To foster and benchmark this capability, this paper introduces the UniM benchmark, the first Unified Any-to-Any Interleaved Multimodal dataset. UniM contains 31K high-quality instances across 30 domains and 7 representative modalities: text, image, audio, video, document, code, and 3D, each requiring multiple intertwined reasoning and generation capabilities. We further introduce the UniM Evaluation Suite, which assesses models along three dimensions: Semantic Correctness & Generation Quality, Response Structure Integrity, and Interleaved Coherence. In addition, we propose UniMA, an agentic baseline model equipped with traceable reasoning for structured interleaved generation. Comprehensive experiments demonstrate the difficulty of UniM and highlight key challenges and directions for advancing unified any-to-any multimodal intelligence. The project page is https://any2any-mllm.github.io/unim.
Paper Structure (70 sections, 32 equations, 111 figures, 30 tables)

This paper contains 70 sections, 32 equations, 111 figures, 30 tables.

Figures (111)

  • Figure 1: Illustration of the any-to-any interleaved multimodal paradigm with different real-world application scenarios. Solving any-to-any interleaved multimodal learning requires complex and combined capabilities.
  • Figure 2: Distribution of different difficulty levels.
  • Figure 3: Illustration of the UniM evaluation suite. ① refers to the calculation process of the StS and LeS (§\ref{['Response Structure Integrity']}). ② represents the calculation process of the ICS in Eq. (\ref{['eq:2']}). ③ refers to the calculation process of the SQCS; please refer to Eq. (\ref{['eq:1']}).
  • Figure 4: Overview of the UniMA architecture.
  • Figure 5: Results for rationality verification of SQCS and ICS.
  • ...and 106 more figures