Table of Contents
Fetching ...

Edu-EmotionNet: Cross-Modality Attention Alignment with Temporal Feedback Loops

S M Rafiuddin

TL;DR

Edu-EmotionNet tackles robust, real-time multimodal emotion recognition in online education by jointly modeling emotion dynamics and modality reliability. It introduces Cross-Modality Attention Alignment for contextual cross-modal reasoning, a Modality Importance Estimator for dynamic, confidence-based fusion, and a Temporal Feedback Loop to enforce temporal consistency. On re-annotated educational subsets of IEMOCAP and MOSEI, it achieves state-of-the-art accuracy and macro-F1 while showing resilience to missing or noisy modalities, with interpretable modality weighting that adapts to signal quality. The approach has practical implications for real-time, personalized learning systems and informs future work on incorporating additional signals and efficient deployment.

Abstract

Understanding learner emotions in online education is critical for improving engagement and personalized instruction. While prior work in emotion recognition has explored multimodal fusion and temporal modeling, existing methods often rely on static fusion strategies and assume that modality inputs are consistently reliable, which is rarely the case in real-world learning environments. We introduce Edu-EmotionNet, a novel framework that jointly models temporal emotion evolution and modality reliability for robust affect recognition. Our model incorporates three key components: a Cross-Modality Attention Alignment (CMAA) module for dynamic cross-modal context sharing, a Modality Importance Estimator (MIE) that assigns confidence-based weights to each modality at every time step, and a Temporal Feedback Loop (TFL) that leverages previous predictions to enforce temporal consistency. Evaluated on educational subsets of IEMOCAP and MOSEI, re-annotated for confusion, curiosity, boredom, and frustration, Edu-EmotionNet achieves state-of-the-art performance and demonstrates strong robustness to missing or noisy modalities. Visualizations confirm its ability to capture emotional transitions and adaptively prioritize reliable signals, making it well suited for deployment in real-time learning systems

Edu-EmotionNet: Cross-Modality Attention Alignment with Temporal Feedback Loops

TL;DR

Edu-EmotionNet tackles robust, real-time multimodal emotion recognition in online education by jointly modeling emotion dynamics and modality reliability. It introduces Cross-Modality Attention Alignment for contextual cross-modal reasoning, a Modality Importance Estimator for dynamic, confidence-based fusion, and a Temporal Feedback Loop to enforce temporal consistency. On re-annotated educational subsets of IEMOCAP and MOSEI, it achieves state-of-the-art accuracy and macro-F1 while showing resilience to missing or noisy modalities, with interpretable modality weighting that adapts to signal quality. The approach has practical implications for real-time, personalized learning systems and informs future work on incorporating additional signals and efficient deployment.

Abstract

Understanding learner emotions in online education is critical for improving engagement and personalized instruction. While prior work in emotion recognition has explored multimodal fusion and temporal modeling, existing methods often rely on static fusion strategies and assume that modality inputs are consistently reliable, which is rarely the case in real-world learning environments. We introduce Edu-EmotionNet, a novel framework that jointly models temporal emotion evolution and modality reliability for robust affect recognition. Our model incorporates three key components: a Cross-Modality Attention Alignment (CMAA) module for dynamic cross-modal context sharing, a Modality Importance Estimator (MIE) that assigns confidence-based weights to each modality at every time step, and a Temporal Feedback Loop (TFL) that leverages previous predictions to enforce temporal consistency. Evaluated on educational subsets of IEMOCAP and MOSEI, re-annotated for confusion, curiosity, boredom, and frustration, Edu-EmotionNet achieves state-of-the-art performance and demonstrates strong robustness to missing or noisy modalities. Visualizations confirm its ability to capture emotional transitions and adaptively prioritize reliable signals, making it well suited for deployment in real-time learning systems

Paper Structure

This paper contains 20 sections, 2 theorems, 24 equations, 6 figures, 3 tables.

Key Result

Lemma 1

Let $\mathcal{D}=\{(\mathbf{x}^a_t,\mathbf{x}^v_t,\mathbf{x}^t_t)\}_{t=1}^T$, where $\mathbf{x}^a_t$, $\mathbf{x}^v_t$, and $\mathbf{x}^t_t$ denote the audio, visual, and textual feature vectors at time $t$, respectively, and at most one modality input is missing (set to $\mathbf{0}$) at each $t$. T defined by Edu-EmotionNet is Lipschitz continuous with respect to any one modality input when the o

Figures (6)

  • Figure 1: Online platforms lack real-time affective feedback. Edu-EmotionNet fills this gap.
  • Figure 2: Overview of Edu-EmotionNet’s end-to-end pipeline. Raw audio, visual, and text inputs are first encoded (Wav2Vec2$\rightarrow$ Trans_A, ResNet$\rightarrow$ Trans_V, BERT$\rightarrow$ Trans_T), then aligned pairwise via Cross-Modality Attention Alignment (CMAA). A Modality Importance Estimator (MIE) computes confidence weights for each stream, producing a weighted fused feature $z_t$. This feature and the previous soft prediction $\hat{y}_{t-1}$ enter the Temporal Feedback Loop (TFL) to yield $\tilde{z}_t$, which is classified by an MLP+softmax into one of {Confused, Curious, Bored, Frustrated}. Training minimizes cross-entropy plus a KL term $\lambda\,\mathrm{KL}(\hat{y}_{t-1}\|\hat{y}_t)$.
  • Figure 3: Dynamic modality confidence weights over time with 95% confidence error bars. Visual remains dominant, while audio shows high variance under noise (steps 3–5).
  • Figure 4: Per-class F1 score comparison between the hybrid multi-attention fusion baseline (HybridFusion) and Edu-EmotionNet. Score labels are shifted above the bars for clarity.
  • Figure 5: Accuracy under increasing missing modality rates.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Lemma 1
  • proof
  • Theorem 1
  • proof