Table of Contents
Fetching ...

M&M: Multimodal-Multitask Model Integrating Audiovisual Cues in Cognitive Load Assessment

Long Nguyen-Phuoc, Renald Gaboriau, Dimitri Delacroix, Laurent Navarro

TL;DR

The paper tackles cognitive load assessment by integrating audiovisual cues through a multimodal-multitask framework (M&M) implemented on the AVCAffe dataset. It introduces dual streams (AudioNet and VideoNet) with a crossmodal multihead attention mechanism to fuse audio and video features, followed by three task-specific branches for distinct cognitive-load labels. While the approach yields modest gains compared to the AVCAffe single-task baseline, it demonstrates competitive performance and highlights the value of unified multimodal-multitask processing for CLA. The work lays groundwork for future enhancements in multimodal-multitask systems, aiming to improve robustness and efficiency in real-world cognitive-state estimation.

Abstract

This paper introduces the M&M model, a novel multimodal-multitask learning framework, applied to the AVCAffe dataset for cognitive load assessment (CLA). M&M uniquely integrates audiovisual cues through a dual-pathway architecture, featuring specialized streams for audio and video inputs. A key innovation lies in its cross-modality multihead attention mechanism, fusing the different modalities for synchronized multitasking. Another notable feature is the model's three specialized branches, each tailored to a specific cognitive load label, enabling nuanced, task-specific analysis. While it shows modest performance compared to the AVCAffe's single-task baseline, M\&M demonstrates a promising framework for integrated multimodal processing. This work paves the way for future enhancements in multimodal-multitask learning systems, emphasizing the fusion of diverse data types for complex task handling.

M&M: Multimodal-Multitask Model Integrating Audiovisual Cues in Cognitive Load Assessment

TL;DR

The paper tackles cognitive load assessment by integrating audiovisual cues through a multimodal-multitask framework (M&M) implemented on the AVCAffe dataset. It introduces dual streams (AudioNet and VideoNet) with a crossmodal multihead attention mechanism to fuse audio and video features, followed by three task-specific branches for distinct cognitive-load labels. While the approach yields modest gains compared to the AVCAffe single-task baseline, it demonstrates competitive performance and highlights the value of unified multimodal-multitask processing for CLA. The work lays groundwork for future enhancements in multimodal-multitask systems, aiming to improve robustness and efficiency in real-world cognitive-state estimation.

Abstract

This paper introduces the M&M model, a novel multimodal-multitask learning framework, applied to the AVCAffe dataset for cognitive load assessment (CLA). M&M uniquely integrates audiovisual cues through a dual-pathway architecture, featuring specialized streams for audio and video inputs. A key innovation lies in its cross-modality multihead attention mechanism, fusing the different modalities for synchronized multitasking. Another notable feature is the model's three specialized branches, each tailored to a specific cognitive load label, enabling nuanced, task-specific analysis. While it shows modest performance compared to the AVCAffe's single-task baseline, M\&M demonstrates a promising framework for integrated multimodal processing. This work paves the way for future enhancements in multimodal-multitask learning systems, emphasizing the fusion of diverse data types for complex task handling.
Paper Structure (22 sections, 6 equations, 2 figures, 2 tables, 1 algorithm)

This paper contains 22 sections, 6 equations, 2 figures, 2 tables, 1 algorithm.

Figures (2)

  • Figure 1: The M&M model's architecture
  • Figure 2: Some examples of the AVCAffe dataset. The self-reported cognitive score shown here at the NASA-TLX scale.