Supporting Experts with a Multimodal Machine-Learning-Based Tool for Human Behavior Analysis of Conversational Videos
Riku Arakawa, Kiyosu Maeda, Hiromu Yakura
TL;DR
This paper introduces Providence, a multimodal scene search tool for conversational videos that uses a visual programming interface to let domain experts combine linguistic, para-linguistic, and nonverbal features without coding. Grounded in a formative study with eight experts, Providence emphasizes customizability, transparency, and reusability, and is complemented by a knowledge-share repository that enables collaborative knowledge accumulation. Through a 12-participant user study, Providence demonstrates favorable usability and reduced cognitive load, with trends toward faster task completion and objective, reusable analyses. In in-the-wild deployments with 11 experts, the tool reshapes workflow by enabling cross-video comparisons and shared queries, while highlighting avenues for feature expansion and semi-automation of query generation to broaden applicability and maintain human-centered control.
Abstract
Multimodal scene search of conversations is essential for unlocking valuable insights into social dynamics and enhancing our communication. While experts in conversational analysis have their own knowledge and skills to find key scenes, a lack of comprehensive, user-friendly tools that streamline the processing of diverse multimodal queries impedes efficiency and objectivity. To solve it, we developed Providence, a visual-programming-based tool based on design considerations derived from a formative study with experts. It enables experts to combine various machine learning algorithms to capture human behavioral cues without writing code. Our study showed its preferable usability and satisfactory output with less cognitive load imposed in accomplishing scene search tasks of conversations, verifying the importance of its customizability and transparency. Furthermore, through the in-the-wild trial, we confirmed the objectivity and reusability of the tool transform experts' workflow, suggesting the advantage of expert-AI teaming in a highly human-contextual domain.
