Table of Contents
Fetching ...

Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding

Minghui Wu, Chenxu Zhao, Anyang Su, Donglin Di, Tianyu Fu, Da An, Min He, Ya Gao, Meng Ma, Kun Yan, Ping Wang

TL;DR

A large-scale Video Subjective Multi-modal Evaluation dataset, namely Video-SME, is introduced and a Hypergraph Multi-modal Large Language Model (HMLLM) is designed to explore the associations among different demographics, video elements, EEG and eye-tracking indicators.

Abstract

Understanding of video creativity and content often varies among individuals, with differences in focal points and cognitive levels across different ages, experiences, and genders. There is currently a lack of research in this area, and most existing benchmarks suffer from several drawbacks: 1) a limited number of modalities and answers with restrictive length; 2) the content and scenarios within the videos are excessively monotonous, transmitting allegories and emotions that are overly simplistic. To bridge the gap to real-world applications, we introduce a large-scale Subjective Response Indicators for Advertisement Videos dataset, namely SRI-ADV. Specifically, we collected real changes in Electroencephalographic (EEG) and eye-tracking regions from different demographics while they viewed identical video content. Utilizing this multi-modal dataset, we developed tasks and protocols to analyze and evaluate the extent of cognitive understanding of video content among different users. Along with the dataset, we designed a Hypergraph Multi-modal Large Language Model (HMLLM) to explore the associations among different demographics, video elements, EEG, and eye-tracking indicators. HMLLM could bridge semantic gaps across rich modalities and integrate information beyond different modalities to perform logical reasoning. Extensive experimental evaluations on SRI-ADV and other additional video-based generative performance benchmarks demonstrate the effectiveness of our method. The codes and dataset will be released at https://github.com/mininglamp-MLLM/HMLLM.

Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding

TL;DR

A large-scale Video Subjective Multi-modal Evaluation dataset, namely Video-SME, is introduced and a Hypergraph Multi-modal Large Language Model (HMLLM) is designed to explore the associations among different demographics, video elements, EEG and eye-tracking indicators.

Abstract

Understanding of video creativity and content often varies among individuals, with differences in focal points and cognitive levels across different ages, experiences, and genders. There is currently a lack of research in this area, and most existing benchmarks suffer from several drawbacks: 1) a limited number of modalities and answers with restrictive length; 2) the content and scenarios within the videos are excessively monotonous, transmitting allegories and emotions that are overly simplistic. To bridge the gap to real-world applications, we introduce a large-scale Subjective Response Indicators for Advertisement Videos dataset, namely SRI-ADV. Specifically, we collected real changes in Electroencephalographic (EEG) and eye-tracking regions from different demographics while they viewed identical video content. Utilizing this multi-modal dataset, we developed tasks and protocols to analyze and evaluate the extent of cognitive understanding of video content among different users. Along with the dataset, we designed a Hypergraph Multi-modal Large Language Model (HMLLM) to explore the associations among different demographics, video elements, EEG, and eye-tracking indicators. HMLLM could bridge semantic gaps across rich modalities and integrate information beyond different modalities to perform logical reasoning. Extensive experimental evaluations on SRI-ADV and other additional video-based generative performance benchmarks demonstrate the effectiveness of our method. The codes and dataset will be released at https://github.com/mininglamp-MLLM/HMLLM.
Paper Structure (32 sections, 6 equations, 6 figures, 9 tables, 1 algorithm)

This paper contains 32 sections, 6 equations, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: Our proposed Subjective Response Indicators for Advertisement Videos (SRI-ADV) dataset. Real-time signals captured by electroencephalographic (EEG) and eye-tracking devices reveal that Audience Profiles (AP) of varying genders and ages exhibit distinct engagements, emotions, and eye motion ratios (EMR) when exposed to various scenes and elements within the same advertisement video.
  • Figure 2: Generation pipeline of SRI-ADV dataset. The left side of this figure illustrates the process of SRI data collection, computation, and amalgamation. This involves acquiring raw signals from subjects, processing signals by video scenes, and pooling data from subjects with similar demographic profiles to obtain aggregated subjective response indicators and instruction for language models. The middle section depicts the video preprocessing with Frame Sequence for Video Representation (FSVR) by scene detection and Automatic Speech Recognition (ASR) for videos. On the right side, we present our proposed semi-automated video Q&A generation process, which leverages both video storyboarding from FSVR and dialogue text from ASR. This integration enriches video content comprehension, thereby facilitating both Subjectivity and Objectivity Tasks.
  • Figure 3: Overview of the Hypergraph Multi-modal Large Language Model (HMLLM). The architecture comprises a suite of pre-trained models, including a "Visual Encoder", "Q-Former", and the "SRI-Aware Language Model (SALM)", which are initially frozen and subsequently fine-tuned through strategic training procedures. More importantly, our model incorporates a designed "SRI-Aware Language Hypergraph Learning (SAL-HL)" module that is trained de novo via a combined loss function. During inference, the HMLLM generates SRI and Q&A responses tailored to the video content, thereby providing a deeper level of engagement and comprehension.
  • Figure 4: Qualitative analysis of SRI-ADV. Green signifies accurate descriptions, while red denotes incorrect responses.
  • Figure 5: Equipment for Collecting Subjective Responses of SRI-ADV dataset. During data acquisition, participants wear an EEG Device, facing a Video Display, with an Eye-Tracking Device below to monitor gaze. Video durations and subjective responses are recorded on an Integration Display for analysis.
  • ...and 1 more figures