Identification of Conversation Partners from Egocentric Video
Tobias Dorszewski, Søren A. Fuglsang, Jens Hjortkjær
TL;DR
This work defines the problem of identifying conversation partners for a camera wearer from egocentric video in multi-conversation settings and presents a new, richly annotated dataset (68.9 hours, 6.2 million frames, 877 clips) with ground-truth partner labels. It formulates a per-face, frame-level binary classification task evaluated by average precision (AP) and provides initial baselines based on simple visual cues (center proximity, face size, detection confidence, and face-recognition similarity). Results show AP above 0.65–0.7 on various subsets, but performance is highly sensitive to conversation context and group size, underscoring generalization challenges and the value of temporal information. The paper lays groundwork for egocentric social-partner analysis and envisions downstream applications in beamforming and selective speech separation for hearing aids, while suggesting future work on gaze, long-term context, and integration with existing Ego4D/SAAL signals.
Abstract
Communicating in noisy, multi-talker environments is challenging, especially for people with hearing impairments. Egocentric video data can potentially be used to identify a user's conversation partners, which could be used to inform selective acoustic amplification of relevant speakers. Recent introduction of datasets and tasks in computer vision enable progress towards analyzing social interactions from an egocentric perspective. Building on this, we focus on the task of identifying conversation partners from egocentric video and describe a suitable dataset. Our dataset comprises 69 hours of egocentric video of diverse multi-conversation scenarios where each individual was assigned one or more conversation partners, providing the labels for our computer vision task. This dataset enables the development and assessment of algorithms for identifying conversation partners and evaluating related approaches. Here, we describe the dataset alongside initial baseline results of this ongoing work, aiming to contribute to the exciting advancements in egocentric video analysis for social settings.
