Table of Contents
Fetching ...

The Moment of Capture: How the First Seconds of a Speaker's Nonverbal and Verbal Performance Shapes Audience Judgments

Ralf Schmälzle, Yuetong Du, Sue Lim, Gary Bente

Abstract

Why do some speakers capture a room almost instantly while others fail to connect? The real-time architecture of audience engagement remains largely a black box. Here, we used motion-captured animations to present the pure nonverbal performance of public speakers to audiences - either in silence (nonverbal-only) or paired with the verbal content (nonverbal-plus-verbal). Using continuous response measurement (CRM), we find that audience judgments solidify with remarkable speed: Moment-to-moment engagement ratings become highly predictive of subsequent evaluations within the initial 10 seconds of the performance. Most notably, this predictive relationship emerged faster and slightly stronger in the nonverbal-only condition, with predictive information being present already after less than 5 seconds. These findings elucidate the social impact a speaker's nonverbal performance has on audience impressions, even when dissociated from the verbal content of the speech. Our approach provides a high-resolution temporal map of social impression formation, pointing to an early "moment of capture" that appears to set the stage for the reception of the following message. On a broader scale, this research validates a powerful new method to isolate different communicative channels, to scientifically deconstruct rhetorical skill, and to study the pervasive impact of nonverbal behavior more broadly. It also enables us to translate the ancient art of rhetoric into a modern science of social impression formation, yielding an empirical basis that can inform human-centered feedback, develop AI-based augmentation tools, and guide the design of engaging, socially present avatars in an increasingly AI-mediated and virtual world.

The Moment of Capture: How the First Seconds of a Speaker's Nonverbal and Verbal Performance Shapes Audience Judgments

Abstract

Why do some speakers capture a room almost instantly while others fail to connect? The real-time architecture of audience engagement remains largely a black box. Here, we used motion-captured animations to present the pure nonverbal performance of public speakers to audiences - either in silence (nonverbal-only) or paired with the verbal content (nonverbal-plus-verbal). Using continuous response measurement (CRM), we find that audience judgments solidify with remarkable speed: Moment-to-moment engagement ratings become highly predictive of subsequent evaluations within the initial 10 seconds of the performance. Most notably, this predictive relationship emerged faster and slightly stronger in the nonverbal-only condition, with predictive information being present already after less than 5 seconds. These findings elucidate the social impact a speaker's nonverbal performance has on audience impressions, even when dissociated from the verbal content of the speech. Our approach provides a high-resolution temporal map of social impression formation, pointing to an early "moment of capture" that appears to set the stage for the reception of the following message. On a broader scale, this research validates a powerful new method to isolate different communicative channels, to scientifically deconstruct rhetorical skill, and to study the pervasive impact of nonverbal behavior more broadly. It also enables us to translate the ancient art of rhetoric into a modern science of social impression formation, yielding an empirical basis that can inform human-centered feedback, develop AI-based augmentation tools, and guide the design of engaging, socially present avatars in an increasingly AI-mediated and virtual world.
Paper Structure (21 sections, 3 figures)

This paper contains 21 sections, 3 figures.

Figures (3)

  • Figure 1: Figure 1. The Current Study: Stimuli, Conditions, and Analysis Overview. Top panel: Illustration of stimulus creation based on real-life public speaking corpus, using motion-animation and voice-neutralization methods. A rating study was conducted to annotate these stimuli under either nonverbal-only or nonverbal-plus-verbal viewing conditions. Middle panel: Details on evaluation procedures: Nonverbal-only and nonverbal-plus-nonverbal performances were shown to test audiences, asking them to i) continuously evaluate engagingness and ii) provide retrospective summary evaluations (including engagingness as well as speaker and speech impressions). Bottom panel: Analysis rationale: The retrospective summary evaluations (collected after the performance) serve as the outcome criterion, and the continuously collected (CRM) ratings as the predictor. We then compute correlations between the moment-by-moment predictor values (CRM ratings at point1, point2, etc.) and the outcomes (final ratings). This enables us to map the temporal dynamics of impression formation and test at which point correlations are significant. Furthermore, we can disentangle the relative contributions of the nonverbal delivery from that of the verbal message.
  • Figure 2: Figure 2. Correlations between CRM-Ratings and Subsequent Evaluations. The top panels show the correlations between moment-by-moment CRM evaluations and the corresponding outcome evaluations for each speech – separated according to condition (red: nonverbal-only; blue: nonverbal-plus-verbal) and the CRM time points (0, 10, 30, and 60 seconds). The large panel shows the aggregated results (CRM-Engagement-Correlation curves) across all time points and for both conditions. As can be seen, the correlations between in-the-moment CRM ratings and final outcome ratings rise quickly, reaching the significance threshold (red line) already after a few seconds. Shaded areas represent the 95%-confidence interval around the measured correlations for each time point. The small panels on the bottom show the positive relationship between nonverbal-only and nonverbal+verbal outcome engagement ratings (top) as well as the positive relationship between the nonverbal-only ratings and LLM-based evaluations of the textual transcripts of the same speeches (external to this study).
  • Figure 3: Supplementary Figure 1: Screenshots from nine different speeches, illustrating the nature of the nonverbal avatar animations. Each video was 60 sec long and was generated based on the motion capture recordings from the real-life speakers presenting about their work, which was then re-rendered onto neutral avatar figures. This approach perfectly controls stereotypical information but preserves the continuous and natural flow of nonverbal behavior. Half of the observers watched and evaluated these silent animations; the other half watched the same animations accompanied by the corresponding verbal performance (also transformed in an analogous manner to neutralize vocal cues).