Table of Contents
Fetching ...

A Large-Scale Multimodal Dataset and Benchmarks for Human Activity Scene Understanding and Reasoning

Siyang Jiang, Mu Yuan, Xiang Ji, Bufang Yang, Zeyu Liu, Lilin Xu, Yang Li, Yuting He, Liran Dong, Wenrui Lu, Zhenyu Yan, Xiaofan Jiang, Wei Gao, Hongkai Chen, Guoliang Xing

TL;DR

CUHK-X tackles the need for richly described, multimodal human activity data by introducing a large-scale GT-first dataset with seven sensor modalities and 40 actions across two indoor environments. A scene-based caption generation framework, aided by LLMs and human checks, yields coherent <data,caption> pairs that support HAR, HAU, and HARn benchmarks. The paper provides extensive baselines across modalities and tasks, revealing modality strengths, long-tail and cross-subject challenges, and the value of fine-tuning for HAR, while highlighting reasoning-based models as superior for HARn. By delivering synchronized multimodal data and rigorous evaluation protocols, CUHK-X offers a foundation for robust multimodal learning, sensor fusion, and LLM-based action understanding in real-world settings.

Abstract

Multimodal human action recognition (HAR) leverages complementary sensors for activity classification. Beyond recognition, recent advances in large language models (LLMs) enable detailed descriptions and causal reasoning, motivating new tasks: human action understanding (HAU) and human action reasoning (HARn). However, most LLMs, especially large vision language models (LVLMs), struggle with non-RGB modalities such as depth, IMU, and mmWave due to the lack of large-scale data-caption resources. Existing HAR datasets mainly provide coarse data-label annotations, which are insufficient to capture fine-grained action dynamics needed for HAU and HARn. We consider two ground-truth pair types: (1) data label (discrete category) and (2) data caption (textual description). Naively generating captions from labels often lacks logical and spatiotemporal consistency. We introduce CUHK-X, a large-scale multimodal dataset and benchmark suite for HAR, HAU, and HARn. CUHK-X contains 58,445 samples covering 40 actions performed by 30 participants across two indoor environments. To improve caption consistency, we propose a prompt-based scene creation method that leverages LLMs to generate logically connected activity sequences, followed by human validation. CUHK-X includes three benchmarks with six evaluation tasks. Experiments report average accuracies of 76.52% (HAR), 40.76% (HAU), and 70.25% (HARn). CUHK-X aims to enable the community to apply and develop data-intensive learning methods for robust, multimodal human activity analysis. Project page and code: https://openaiotlab.github.io/CUHK-X/ and https://github.com/openaiotlab/CUHK-X.

A Large-Scale Multimodal Dataset and Benchmarks for Human Activity Scene Understanding and Reasoning

TL;DR

CUHK-X tackles the need for richly described, multimodal human activity data by introducing a large-scale GT-first dataset with seven sensor modalities and 40 actions across two indoor environments. A scene-based caption generation framework, aided by LLMs and human checks, yields coherent <data,caption> pairs that support HAR, HAU, and HARn benchmarks. The paper provides extensive baselines across modalities and tasks, revealing modality strengths, long-tail and cross-subject challenges, and the value of fine-tuning for HAR, while highlighting reasoning-based models as superior for HARn. By delivering synchronized multimodal data and rigorous evaluation protocols, CUHK-X offers a foundation for robust multimodal learning, sensor fusion, and LLM-based action understanding in real-world settings.

Abstract

Multimodal human action recognition (HAR) leverages complementary sensors for activity classification. Beyond recognition, recent advances in large language models (LLMs) enable detailed descriptions and causal reasoning, motivating new tasks: human action understanding (HAU) and human action reasoning (HARn). However, most LLMs, especially large vision language models (LVLMs), struggle with non-RGB modalities such as depth, IMU, and mmWave due to the lack of large-scale data-caption resources. Existing HAR datasets mainly provide coarse data-label annotations, which are insufficient to capture fine-grained action dynamics needed for HAU and HARn. We consider two ground-truth pair types: (1) data label (discrete category) and (2) data caption (textual description). Naively generating captions from labels often lacks logical and spatiotemporal consistency. We introduce CUHK-X, a large-scale multimodal dataset and benchmark suite for HAR, HAU, and HARn. CUHK-X contains 58,445 samples covering 40 actions performed by 30 participants across two indoor environments. To improve caption consistency, we propose a prompt-based scene creation method that leverages LLMs to generate logically connected activity sequences, followed by human validation. CUHK-X includes three benchmarks with six evaluation tasks. Experiments report average accuracies of 76.52% (HAR), 40.76% (HAU), and 70.25% (HARn). CUHK-X aims to enable the community to apply and develop data-intensive learning methods for robust, multimodal human activity analysis. Project page and code: https://openaiotlab.github.io/CUHK-X/ and https://github.com/openaiotlab/CUHK-X.

Paper Structure

This paper contains 70 sections, 14 figures, 4 tables.

Figures (14)

  • Figure 1: CUHK-X captures a multi-room home environment and supports three tasks: HAR (classification), HAU (captioning), and HARn (intention prediction). It integrates diverse modalities, including RGB, depth, thermal, infrared, IMU, skeleton, and mmWave.
  • Figure 2: Limitations of SOTA LVLMs in HAU tasks.
  • Figure 3: Frequency of the actions among USC zhang2012usc, Shoaib shoaib2014fusion, HHAR stisen2015smart, UTD chen2015utd ActivityNet yu2019activitynet, UCI reyes2016transition, NTU liu2019ntushahroudy2016ntu, PKU-MMD liu2017pku, Cosmo ouyang2022cosmo, mRI an2022mri, Thermal-IM ThermalIM2023.
  • Figure 4: CUHK-X includes 40 actions in 7 categories.
  • Figure 5: Photos of our ambient and wearable sensor hardware.
  • ...and 9 more figures