Table of Contents
Fetching ...

Supporting Experts with a Multimodal Machine-Learning-Based Tool for Human Behavior Analysis of Conversational Videos

Riku Arakawa, Kiyosu Maeda, Hiromu Yakura

TL;DR

This paper introduces Providence, a multimodal scene search tool for conversational videos that uses a visual programming interface to let domain experts combine linguistic, para-linguistic, and nonverbal features without coding. Grounded in a formative study with eight experts, Providence emphasizes customizability, transparency, and reusability, and is complemented by a knowledge-share repository that enables collaborative knowledge accumulation. Through a 12-participant user study, Providence demonstrates favorable usability and reduced cognitive load, with trends toward faster task completion and objective, reusable analyses. In in-the-wild deployments with 11 experts, the tool reshapes workflow by enabling cross-video comparisons and shared queries, while highlighting avenues for feature expansion and semi-automation of query generation to broaden applicability and maintain human-centered control.

Abstract

Multimodal scene search of conversations is essential for unlocking valuable insights into social dynamics and enhancing our communication. While experts in conversational analysis have their own knowledge and skills to find key scenes, a lack of comprehensive, user-friendly tools that streamline the processing of diverse multimodal queries impedes efficiency and objectivity. To solve it, we developed Providence, a visual-programming-based tool based on design considerations derived from a formative study with experts. It enables experts to combine various machine learning algorithms to capture human behavioral cues without writing code. Our study showed its preferable usability and satisfactory output with less cognitive load imposed in accomplishing scene search tasks of conversations, verifying the importance of its customizability and transparency. Furthermore, through the in-the-wild trial, we confirmed the objectivity and reusability of the tool transform experts' workflow, suggesting the advantage of expert-AI teaming in a highly human-contextual domain.

Supporting Experts with a Multimodal Machine-Learning-Based Tool for Human Behavior Analysis of Conversational Videos

TL;DR

This paper introduces Providence, a multimodal scene search tool for conversational videos that uses a visual programming interface to let domain experts combine linguistic, para-linguistic, and nonverbal features without coding. Grounded in a formative study with eight experts, Providence emphasizes customizability, transparency, and reusability, and is complemented by a knowledge-share repository that enables collaborative knowledge accumulation. Through a 12-participant user study, Providence demonstrates favorable usability and reduced cognitive load, with trends toward faster task completion and objective, reusable analyses. In in-the-wild deployments with 11 experts, the tool reshapes workflow by enabling cross-video comparisons and shared queries, while highlighting avenues for feature expansion and semi-automation of query generation to broaden applicability and maintain human-centered control.

Abstract

Multimodal scene search of conversations is essential for unlocking valuable insights into social dynamics and enhancing our communication. While experts in conversational analysis have their own knowledge and skills to find key scenes, a lack of comprehensive, user-friendly tools that streamline the processing of diverse multimodal queries impedes efficiency and objectivity. To solve it, we developed Providence, a visual-programming-based tool based on design considerations derived from a formative study with experts. It enables experts to combine various machine learning algorithms to capture human behavioral cues without writing code. Our study showed its preferable usability and satisfactory output with less cognitive load imposed in accomplishing scene search tasks of conversations, verifying the importance of its customizability and transparency. Furthermore, through the in-the-wild trial, we confirmed the objectivity and reusability of the tool transform experts' workflow, suggesting the advantage of expert-AI teaming in a highly human-contextual domain.
Paper Structure (48 sections, 6 figures, 6 tables)

This paper contains 48 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Interface of Providence with an example query that detects scenes when a person is questioning but not looking straight at the screen according to one's gaze. Some parts (e.g., attendees' faces) are anonymized for blind review.
  • Figure 2: Architecture of Providence, which consists of (A) a programmatic framework for querying conversational videos, (B) a frontend interface with visual programming and feature visualization, and (C) a backend server for machine-learning algorithms and query processing.
  • Figure 3: An example code of searching scenes with Providence's programmatic framework. In this example, two features (i.e., nod count and voice activity) are combined. The input format was enabled to be flexible to accept videos recorded on common video-conferencing platforms while maintaining a consistent output format.
  • Figure 4: Participants' evaluation of their cognitive load in the user study. $p < .05$ is marked as *.
  • Figure 5: Knowledge-share repository of Providence where users can contribute and access queries.
  • ...and 1 more figures