Table of Contents
Fetching ...

MA-Bench: Towards Fine-grained Micro-Action Understanding

Kun Li, Jihao Gu, Fei Wang, Zhiliang Wu, Hehe Fan, Dan Guo

Abstract

With the rapid development of Multimodal Large Language Models (MLLMs), their potential in Micro-Action understanding, a vital role in human emotion analysis, remains unexplored due to the absence of specialized benchmarks. To tackle this issue, we present MA-Bench, a benchmark comprising 1,000 videos and a three-tier evaluation architecture that progressively examines micro-action perception, relational comprehension, and interpretive reasoning. MA-Bench contains 12,000 structured question-answer pairs, enabling systematic assessment of both recognition accuracy and action interpretation. The results of 23 representative MLLMs reveal that there are significant challenges in capturing motion granularity and fine-grained body-part dynamics. To address these challenges, we further construct MA-Bench-Train, a large-scale training corpus with 20.5K videos annotated with structured micro-action captions for fine-tuning MLLMs. The results of Qwen3-VL-8B fine-tuned on MA-Bench-Train show clear performance improvements across micro-action reasoning and explanation tasks. Our work aims to establish a foundation benchmark for advancing MLLMs in understanding subtle micro-action and human-related behaviors. Project Page: https://MA-Bench.github.io

MA-Bench: Towards Fine-grained Micro-Action Understanding

Abstract

With the rapid development of Multimodal Large Language Models (MLLMs), their potential in Micro-Action understanding, a vital role in human emotion analysis, remains unexplored due to the absence of specialized benchmarks. To tackle this issue, we present MA-Bench, a benchmark comprising 1,000 videos and a three-tier evaluation architecture that progressively examines micro-action perception, relational comprehension, and interpretive reasoning. MA-Bench contains 12,000 structured question-answer pairs, enabling systematic assessment of both recognition accuracy and action interpretation. The results of 23 representative MLLMs reveal that there are significant challenges in capturing motion granularity and fine-grained body-part dynamics. To address these challenges, we further construct MA-Bench-Train, a large-scale training corpus with 20.5K videos annotated with structured micro-action captions for fine-tuning MLLMs. The results of Qwen3-VL-8B fine-tuned on MA-Bench-Train show clear performance improvements across micro-action reasoning and explanation tasks. Our work aims to establish a foundation benchmark for advancing MLLMs in understanding subtle micro-action and human-related behaviors. Project Page: https://MA-Bench.github.io

Paper Structure

This paper contains 15 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The pipeline for constructing MA-Bench and MA-Bench-Train for fine-grained micro-action understanding. (1) Micro-motion tracker extracts motion descriptors (i.e., motion vectors and coordinates) for each body part. (2) Micro-action benchmark generation leverages motion descriptors and multimodal large language models to create structured micro-action captions, which are then used to generate the benchmarks. (3) MA-Bench enables fine-grained action understanding through well-defined perceptual recognition, relational comprehension, and interpretive reasoning.
  • Figure 2: Data statistics of MA-Bench. Action category definitions are consistent with the Micro-Action-52 dataset guo2024benchmarking.
  • Figure 3: Example of question-answer pairs from the proposed MA-Bench.
  • Figure 4: The pipeline for semi-automatic question-answer generation.
  • Figure 5: Qualitative example on the task of Micro-Action Reasoning and Explanation (MARE). Green font denotes correct predictions, red indicates completely incorrect ones, yellow marks partial errors, black shows results with no impact, and purple highlights faulty reasoning chains. More examples of other tasks are provided in the supplementary material.