Multimodal Machine Learning Can Predict Videoconference Fluidity and Enjoyment
Andrew Chang, Viswadruth Akkaraju, Ray McFadden Cogliano, David Poeppel, Dustin Freeman
TL;DR
This work investigates predicting subjective videoconference experiences using multimodal signals from short clips. By extracting audio embeddings (VGGish, YAMNet, Wav2Vec2), facial action units, and GC-based body-motion features from RoomReader clips, the authors train a cross-session logistic regression model with PCA, achieving high ROC-AUC for fluidity, enjoyment, and event classification, with domain-general audio features driving high-level outcomes. The results show that combining modalities yields the strongest predictions (e.g., ROC-AUC up to 0.874 for Enjoyment and 0.867 for event classification), while pre-event facial cues retain predictive power; generalization across fluidity and enjoyment is demonstrated. These findings suggest multimodal ML can enable bulk, automated analysis and potential in-session interventions to mitigate negative conversational experiences, though generalizability beyond the RoomReader corpus remains a limitation.
Abstract
Videoconferencing is now a frequent mode of communication in both professional and informal settings, yet it often lacks the fluidity and enjoyment of in-person conversation. This study leverages multimodal machine learning to predict moments of negative experience in videoconferencing. We sampled thousands of short clips from the RoomReader corpus, extracting audio embeddings, facial actions, and body motion features to train models for identifying low conversational fluidity, low enjoyment, and classifying conversational events (backchanneling, interruption, or gap). Our best models achieved an ROC-AUC of up to 0.87 on hold-out videoconference sessions, with domain-general audio features proving most critical. This work demonstrates that multimodal audio-video signals can effectively predict high-level subjective conversational outcomes. In addition, this is a contribution to research on videoconferencing user experience by showing that multimodal machine learning can be used to identify rare moments of negative user experience for further study or mitigation.
