Table of Contents
Fetching ...

Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark

Hanlei Zhang, Zhuohang Li, Yeshuang Zhu, Hua Xu, Peiwu Wang, Haige Zhu, Jie Zhou, Jinchao Zhang

TL;DR

MMLA introduces the first comprehensive benchmark for multimodal language analysis focusing on high-level cognitive semantics across six dimensions. By aggregating 61K multimodal utterances from nine datasets and evaluating both LLMs and MLLMs under zero-shot, SFT, and IT with LoRA adaptations, the study reveals that current models struggle in zero-shot but benefit substantially from supervised and instruction-based tuning, with small MLLMs able to compete with larger ones when properly trained. IT enables unified models that perform across tasks, and results show that even the best models remain below 70% accuracy on average, underscoring the challenge and need for architectural advances and better data. Overall, MMLA provides a solid foundation and open resources to drive progress in multimodal language analysis and cross-modal cognition.

Abstract

Multimodal language analysis is a rapidly evolving field that leverages multiple modalities to enhance the understanding of high-level semantics underlying human conversational utterances. Despite its significance, little research has investigated the capability of multimodal large language models (MLLMs) to comprehend cognitive-level semantics. In this paper, we introduce MMLA, a comprehensive benchmark specifically designed to address this gap. MMLA comprises over 61K multimodal utterances drawn from both staged and real-world scenarios, covering six core dimensions of multimodal semantics: intent, emotion, dialogue act, sentiment, speaking style, and communication behavior. We evaluate eight mainstream branches of LLMs and MLLMs using three methods: zero-shot inference, supervised fine-tuning, and instruction tuning. Extensive experiments reveal that even fine-tuned models achieve only about 60%~70% accuracy, underscoring the limitations of current MLLMs in understanding complex human language. We believe that MMLA will serve as a solid foundation for exploring the potential of large language models in multimodal language analysis and provide valuable resources to advance this field. The datasets and code are open-sourced at https://github.com/thuiar/MMLA.

Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark

TL;DR

MMLA introduces the first comprehensive benchmark for multimodal language analysis focusing on high-level cognitive semantics across six dimensions. By aggregating 61K multimodal utterances from nine datasets and evaluating both LLMs and MLLMs under zero-shot, SFT, and IT with LoRA adaptations, the study reveals that current models struggle in zero-shot but benefit substantially from supervised and instruction-based tuning, with small MLLMs able to compete with larger ones when properly trained. IT enables unified models that perform across tasks, and results show that even the best models remain below 70% accuracy on average, underscoring the challenge and need for architectural advances and better data. Overall, MMLA provides a solid foundation and open resources to drive progress in multimodal language analysis and cross-modal cognition.

Abstract

Multimodal language analysis is a rapidly evolving field that leverages multiple modalities to enhance the understanding of high-level semantics underlying human conversational utterances. Despite its significance, little research has investigated the capability of multimodal large language models (MLLMs) to comprehend cognitive-level semantics. In this paper, we introduce MMLA, a comprehensive benchmark specifically designed to address this gap. MMLA comprises over 61K multimodal utterances drawn from both staged and real-world scenarios, covering six core dimensions of multimodal semantics: intent, emotion, dialogue act, sentiment, speaking style, and communication behavior. We evaluate eight mainstream branches of LLMs and MLLMs using three methods: zero-shot inference, supervised fine-tuning, and instruction tuning. Extensive experiments reveal that even fine-tuned models achieve only about 60%~70% accuracy, underscoring the limitations of current MLLMs in understanding complex human language. We believe that MMLA will serve as a solid foundation for exploring the potential of large language models in multimodal language analysis and provide valuable resources to advance this field. The datasets and code are open-sourced at https://github.com/thuiar/MMLA.

Paper Structure

This paper contains 17 sections, 1 equation, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Overview of the MMLA benchmark. The left side shows examples from six evaluation dimensions and nine datasets. The right side displays three methods for evaluating both LLMs and MLLMs: (1) zero-shot inference (top right), which generates predictions from task-specific prompts; (2) supervised fine-tuning (middle right), which trains on each supervised task; and (3) instruction tuning (bottom right), which trains on multiple tasks simultaneously. Both (2) and (3) utilize LoRA to efficiently adapt foundation models.
  • Figure 2: Rank of foundation models after zero-shot inference.
  • Figure 3: Rank of foundation models after SFT and IT.
  • Figure 4: Fine‑grained zero‑shot inference and SFT performance (ACC). Within each bar, the light-colored lower segment corresponds to zero-shot inference performance, while the darker upper segment represents the additional gains from SFT. The performance of SOTA MML methods (if available) and GPT‑4o are indicated with purple and green dashed lines, respectively.
  • Figure 5: Fine‑grained performance (ACC) of instruction‑tuned MLLMs and LLMs on each dataset across six dimensions. The performance of SOTA MML methods and humans are indicated with dashed lines, if available.
  • ...and 1 more figures