Table of Contents
Fetching ...

Automated Assessment of Encouragement and Warmth in Classrooms Leveraging Multimodal Emotional Features and ChatGPT

Ruikun Hou, Tim Fütterer, Babette Bühler, Efe Bozkir, Peter Gerjets, Ulrich Trautwein, Enkelejda Kasneci

TL;DR

This study addresses the challenge of objectively and efficiently assessing Encouragement and Warmth (EW) in classroom settings, traditionally measured via manual human observation. It proposes a multimodal pipeline that combines facial emotion (valence/arousal and expressions), speech emotion, and transcript sentiment, along with a separate ChatGPT zero-shot pathway to rate EW from transcripts, and an ensemble that fuses both. The best supervised multimodal regressor achieves $r = 0.441$, while GPT-4 zero-shot attains $r = 0.341$; an averaging ensemble reaches $r = 0.513$, comparable to human inter-rater reliability. SHAP analyses reveal text sentiment features as the primary drivers, and GPT-4 can provide explicit reasoning for its scores, suggesting practical utility for scalable feedback and teacher development across multilingual classroom data.

Abstract

Classroom observation protocols standardize the assessment of teaching effectiveness and facilitate comprehension of classroom interactions. Whereas these protocols offer teachers specific feedback on their teaching practices, the manual coding by human raters is resource-intensive and often unreliable. This has sparked interest in developing AI-driven, cost-effective methods for automating such holistic coding. Our work explores a multimodal approach to automatically estimating encouragement and warmth in classrooms, a key component of the Global Teaching Insights (GTI) study's observation protocol. To this end, we employed facial and speech emotion recognition with sentiment analysis to extract interpretable features from video, audio, and transcript data. The prediction task involved both classification and regression methods. Additionally, in light of recent large language models' remarkable text annotation capabilities, we evaluated ChatGPT's zero-shot performance on this scoring task based on transcripts. We demonstrated our approach on the GTI dataset, comprising 367 16-minute video segments from 92 authentic lesson recordings. The inferences of GPT-4 and the best-trained model yielded correlations of r = .341 and r = .441 with human ratings, respectively. Combining estimates from both models through averaging, an ensemble approach achieved a correlation of r = .513, comparable to human inter-rater reliability. Our model explanation analysis indicated that text sentiment features were the primary contributors to the trained model's decisions. Moreover, GPT-4 could deliver logical and concrete reasoning as potential teacher guidelines. Our findings provide insights into using advanced, multimodal techniques for automated classroom observation, aiming to foster teacher training through frequent and valuable feedback.

Automated Assessment of Encouragement and Warmth in Classrooms Leveraging Multimodal Emotional Features and ChatGPT

TL;DR

This study addresses the challenge of objectively and efficiently assessing Encouragement and Warmth (EW) in classroom settings, traditionally measured via manual human observation. It proposes a multimodal pipeline that combines facial emotion (valence/arousal and expressions), speech emotion, and transcript sentiment, along with a separate ChatGPT zero-shot pathway to rate EW from transcripts, and an ensemble that fuses both. The best supervised multimodal regressor achieves , while GPT-4 zero-shot attains ; an averaging ensemble reaches , comparable to human inter-rater reliability. SHAP analyses reveal text sentiment features as the primary drivers, and GPT-4 can provide explicit reasoning for its scores, suggesting practical utility for scalable feedback and teacher development across multilingual classroom data.

Abstract

Classroom observation protocols standardize the assessment of teaching effectiveness and facilitate comprehension of classroom interactions. Whereas these protocols offer teachers specific feedback on their teaching practices, the manual coding by human raters is resource-intensive and often unreliable. This has sparked interest in developing AI-driven, cost-effective methods for automating such holistic coding. Our work explores a multimodal approach to automatically estimating encouragement and warmth in classrooms, a key component of the Global Teaching Insights (GTI) study's observation protocol. To this end, we employed facial and speech emotion recognition with sentiment analysis to extract interpretable features from video, audio, and transcript data. The prediction task involved both classification and regression methods. Additionally, in light of recent large language models' remarkable text annotation capabilities, we evaluated ChatGPT's zero-shot performance on this scoring task based on transcripts. We demonstrated our approach on the GTI dataset, comprising 367 16-minute video segments from 92 authentic lesson recordings. The inferences of GPT-4 and the best-trained model yielded correlations of r = .341 and r = .441 with human ratings, respectively. Combining estimates from both models through averaging, an ensemble approach achieved a correlation of r = .513, comparable to human inter-rater reliability. Our model explanation analysis indicated that text sentiment features were the primary contributors to the trained model's decisions. Moreover, GPT-4 could deliver logical and concrete reasoning as potential teacher guidelines. Our findings provide insights into using advanced, multimodal techniques for automated classroom observation, aiming to foster teacher training through frequent and valuable feedback.
Paper Structure (18 sections, 4 figures, 1 table)

This paper contains 18 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Classroom frame from two cameras, with people erased for privacy.
  • Figure 2: Pipeline for multimodal estimation of EW scores.
  • Figure 3: Prompt for ChatGPT.
  • Figure 4: SHAP summary plot for MLP regressor. Points depict SHAP values per feature per data sample. Features are ranked by their importance (sum of SHAP value magnitudes over all samples). The 10 most influential features are shown.