Automatically Detecting Confusion and Conflict During Collaborative Learning Using Linguistic, Prosodic, and Facial Cues

Yingbo Ma; Yukyeong Song; Mehmet Celepkolu; Kristy Elizabeth Boyer; Eric Wiebe; Collin F. Lynch; Maya Israel

Automatically Detecting Confusion and Conflict During Collaborative Learning Using Linguistic, Prosodic, and Facial Cues

Yingbo Ma, Yukyeong Song, Mehmet Celepkolu, Kristy Elizabeth Boyer, Eric Wiebe, Collin F. Lynch, Maya Israel

TL;DR

This paper tackles the challenge of automatically detecting confusion and conflict during collaborative learning by integrating linguistic, acoustic, and visual cues. It builds a multimodal framework that combines semantics from language with prosodic and facial cues, embedding frame-level audio- and video features using LongFormer and fusing them with neural cross-attention mechanisms. The study, based on 38 elementary students performing pair programming, demonstrates that multimodal models outperform unimodal baselines, with prosodic cues predicting conflict and facial cues predicting confusion, culminating in a high-performing lexical+audio+video detector. The findings support real-time adaptive support systems in classroom contexts and highlight the value of neural fusion methods for modeling inter-modal dependencies across collaboration dynamics.

Abstract

During collaborative learning, confusion and conflict emerge naturally. However, persistent confusion or conflict have the potential to generate frustration and significantly impede learners' performance. Early automatic detection of confusion and conflict would allow us to support early interventions which can in turn improve students' experience with and outcomes from collaborative learning. Despite the extensive studies modeling confusion during solo learning, there is a need for further work in collaborative learning. This paper presents a multimodal machine-learning framework that automatically detects confusion and conflict during collaborative learning. We used data from 38 elementary school learners who collaborated on a series of programming tasks in classrooms. We trained deep multimodal learning models to detect confusion and conflict using features that were automatically extracted from learners' collaborative dialogues, including (1) language-derived features including TF-IDF, lexical semantics, and sentiment, (2) audio-derived features including acoustic-prosodic features, and (3) video-derived features including eye gaze, head pose, and facial expressions. Our results show that multimodal models that combine semantics, pitch, and facial expressions detected confusion and conflict with the highest accuracy, outperforming all unimodal models. We also found that prosodic cues are more predictive of conflict, and facial cues are more predictive of confusion. This study contributes to the automated modeling of collaborative learning processes and the development of real-time adaptive support to enhance learners' collaborative learning experience in classroom contexts.

Automatically Detecting Confusion and Conflict During Collaborative Learning Using Linguistic, Prosodic, and Facial Cues

TL;DR

Abstract

Paper Structure (25 sections, 7 figures, 7 tables)

This paper contains 25 sections, 7 figures, 7 tables.

Introduction
Background
Confusion and Conflict during Learning
Automatic Modeling of Confusion and Conflict
Dataset
Participants and Collaborative Setting
Data Collection
Manual Annotation of Confusion and Conflict
Automatic Speech Recognition (ASR)
Multimodal Features
Data Preprocessing
Language-derived Features
Audio-derived Features
Visual Features
Feature Postprocessing
...and 10 more sections

Figures (7)

Figure 1: The FLECKS learning environment, which embeds virtual learning companions (lower right) in the block-based coding environment.
Figure 2: Two elementary school learners collaborating on a pair programming task. In the captured moment, the learner in the left side of the frame is the navigator and the learner on the right is the driver; their collaborative interaction is video-recorded via Zoom with a front-facing camera and audio-recorded with each learner wearing a lavalier microphone.
Figure 3: A manually-generated sample transcript excerpt
Figure 4: Transformer-based feature embedding subnetwork.
Figure 5: Overview of the multimodal architecture of the confusion and conflict moments detection model with early fusion. The other multimodal models followed the same structure with a subset of the modalities.
...and 2 more figures

Automatically Detecting Confusion and Conflict During Collaborative Learning Using Linguistic, Prosodic, and Facial Cues

TL;DR

Abstract

Automatically Detecting Confusion and Conflict During Collaborative Learning Using Linguistic, Prosodic, and Facial Cues

Authors

TL;DR

Abstract

Table of Contents

Figures (7)