Table of Contents
Fetching ...

Enhancing Public Speaking Skills in Engineering Students Through AI

Amol Harsh, Brainerd Prince, Siddharth Siddharth, Deepan Raj Prabakar Muthirayan, Kabir S Bhalla, Esraaj Sarkar Gupta, Siddharth Sahu

TL;DR

This paper tackles the challenge of scalable, personalized public-speaking training for engineering students by proposing SapienAI, a multi-modal LLM-based evaluator that fuses verbal, non-verbal, and emotional cues to assess expressive coherence. It introduces a new 12-item Public Speaking Competence Rubric extended with dynamic emphasis and emotional resonance, and builds a data pipeline combining transcripts, vocal features, facial expressions, and gestures, evaluated via prompts to LLMs. Benchmarking across four state-of-the-art LLMs against human raters (20 participants) shows Gemini 1.5 Pro achieving the best overall alignment (mean 0.41) and robust performance on multi-modal rubrics, indicating potential to replace or augment human evaluators. The work highlights the promise and challenges of automated, scalable public-speaking feedback, and outlines future directions including cross-cultural validity and physiological data integration.

Abstract

This research-to-practice full paper was inspired by the persistent challenge in effective communication among engineering students. Public speaking is a necessary skill for future engineers as they have to communicate technical knowledge with diverse stakeholders. While universities offer courses or workshops, they are unable to offer sustained and personalized training to students. Providing comprehensive feedback on both verbal and non-verbal aspects of public speaking is time-intensive, making consistent and individualized assessment impractical. This study integrates research on verbal and non-verbal cues in public speaking to develop an AI-driven assessment model for engineering students. Our approach combines speech analysis, computer vision, and sentiment detection into a multi-modal AI system that provides assessment and feedback. The model evaluates (1) verbal communication (pitch, loudness, pacing, intonation), (2) non-verbal communication (facial expressions, gestures, posture), and (3) expressive coherence, a novel integration ensuring alignment between speech and body language. Unlike previous systems that assess these aspects separately, our model fuses multiple modalities to deliver personalized, scalable feedback. Preliminary testing demonstrated that our AI-generated feedback was moderately aligned with expert evaluations. Among the state-of-the-art AI models evaluated, all of which were Large Language Models (LLMs), including Gemini and OpenAI models, Gemini Pro emerged as the best-performing, showing the strongest agreement with human annotators. By eliminating reliance on human evaluators, this AI-driven public speaking trainer enables repeated practice, helping students naturally align their speech with body language and emotion, crucial for impactful and professional communication.

Enhancing Public Speaking Skills in Engineering Students Through AI

TL;DR

This paper tackles the challenge of scalable, personalized public-speaking training for engineering students by proposing SapienAI, a multi-modal LLM-based evaluator that fuses verbal, non-verbal, and emotional cues to assess expressive coherence. It introduces a new 12-item Public Speaking Competence Rubric extended with dynamic emphasis and emotional resonance, and builds a data pipeline combining transcripts, vocal features, facial expressions, and gestures, evaluated via prompts to LLMs. Benchmarking across four state-of-the-art LLMs against human raters (20 participants) shows Gemini 1.5 Pro achieving the best overall alignment (mean 0.41) and robust performance on multi-modal rubrics, indicating potential to replace or augment human evaluators. The work highlights the promise and challenges of automated, scalable public-speaking feedback, and outlines future directions including cross-cultural validity and physiological data integration.

Abstract

This research-to-practice full paper was inspired by the persistent challenge in effective communication among engineering students. Public speaking is a necessary skill for future engineers as they have to communicate technical knowledge with diverse stakeholders. While universities offer courses or workshops, they are unable to offer sustained and personalized training to students. Providing comprehensive feedback on both verbal and non-verbal aspects of public speaking is time-intensive, making consistent and individualized assessment impractical. This study integrates research on verbal and non-verbal cues in public speaking to develop an AI-driven assessment model for engineering students. Our approach combines speech analysis, computer vision, and sentiment detection into a multi-modal AI system that provides assessment and feedback. The model evaluates (1) verbal communication (pitch, loudness, pacing, intonation), (2) non-verbal communication (facial expressions, gestures, posture), and (3) expressive coherence, a novel integration ensuring alignment between speech and body language. Unlike previous systems that assess these aspects separately, our model fuses multiple modalities to deliver personalized, scalable feedback. Preliminary testing demonstrated that our AI-generated feedback was moderately aligned with expert evaluations. Among the state-of-the-art AI models evaluated, all of which were Large Language Models (LLMs), including Gemini and OpenAI models, Gemini Pro emerged as the best-performing, showing the strongest agreement with human annotators. By eliminating reliance on human evaluators, this AI-driven public speaking trainer enables repeated practice, helping students naturally align their speech with body language and emotion, crucial for impactful and professional communication.

Paper Structure

This paper contains 10 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Overview of the AI-powered public speaking evaluation pipeline.
  • Figure 2: An example snippet of rich multi-modal data combining text, vocal, and non-verbal features.
  • Figure 3: Wrist Tracking Over Time (X Direction)
  • Figure 4: LLM Models vs Human Ground Truth (Kappa Scores)