Table of Contents
Fetching ...

SocialPulse: On-Device Detection of Social Interactions in Naturalistic Settings Using Smartwatch Multimodal Sensing

Md Sabbir Ahmed, Kaitlyn Dorothy Petz, Noah French, Tanvi Lakhtakia, Aayushi Sangani, Mark Rucker, Xinyu Chen, Bethany A. Teachman, Laura E. Barnes

TL;DR

The feasibility of real-world interaction sensing is demonstrated and the door is opened to adaptive, context-aware systems responding to users'dynamic social environments.

Abstract

Social interactions are fundamental to well-being, yet automatically detecting them in daily life-particularly using wearables-remains underexplored. Most existing systems are evaluated in controlled settings, focus primarily on in-person interactions, or rely on restrictive assumptions (e.g., requiring multiple speakers within fixed temporal windows), limiting generalizability to real-world use. We present an on-watch interaction detection system designed to capture diverse interactions in naturalistic settings. A core component is a foreground speech detector trained on a public dataset. Evaluated on over 100,000 labeled foreground speech and background sound instances, the detector achieves a balanced accuracy of 85.51%, outperforming prior work by 5.11%. We evaluated the system in a real-world deployment (N=38), with over 900 hours of total smartwatch wear time. The system detected 1,691 interactions, 77.28% were confirmed via participant self-report, with durations ranging from under one minute to over one hour. Among correct detections, 81.45% were in-person, 15.7% virtual, and 1.85% hybrid. Leveraging participant-labeled data, we further developed a multimodal model achieving a balanced accuracy of 90.36% and a sensitivity of 91.17% on 33,698 labeled 15-second windows. These results demonstrate the feasibility of real-world interaction sensing and open the door to adaptive, context-aware systems responding to users' dynamic social environments.

SocialPulse: On-Device Detection of Social Interactions in Naturalistic Settings Using Smartwatch Multimodal Sensing

TL;DR

The feasibility of real-world interaction sensing is demonstrated and the door is opened to adaptive, context-aware systems responding to users'dynamic social environments.

Abstract

Social interactions are fundamental to well-being, yet automatically detecting them in daily life-particularly using wearables-remains underexplored. Most existing systems are evaluated in controlled settings, focus primarily on in-person interactions, or rely on restrictive assumptions (e.g., requiring multiple speakers within fixed temporal windows), limiting generalizability to real-world use. We present an on-watch interaction detection system designed to capture diverse interactions in naturalistic settings. A core component is a foreground speech detector trained on a public dataset. Evaluated on over 100,000 labeled foreground speech and background sound instances, the detector achieves a balanced accuracy of 85.51%, outperforming prior work by 5.11%. We evaluated the system in a real-world deployment (N=38), with over 900 hours of total smartwatch wear time. The system detected 1,691 interactions, 77.28% were confirmed via participant self-report, with durations ranging from under one minute to over one hour. Among correct detections, 81.45% were in-person, 15.7% virtual, and 1.85% hybrid. Leveraging participant-labeled data, we further developed a multimodal model achieving a balanced accuracy of 90.36% and a sensitivity of 91.17% on 33,698 labeled 15-second windows. These results demonstrate the feasibility of real-world interaction sensing and open the door to adaptive, context-aware systems responding to users' dynamic social environments.
Paper Structure (49 sections, 14 figures, 6 tables)

This paper contains 49 sections, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Difference between foreground and background sound. (a) Visualization of embeddings for 50,000 audio frames (25,000 per class). For readability, embeddings for all 227,640 frames are not shown. (b) Predicted foreground speech probabilities for all frames. Homogeneous blocks indicate audio instances in which both sub-frames have predicted probabilities either above or below 50%, whereas heterogeneous blocks indicate instances in which the two sub-frames have opposite predictions (e.g., one sub-frame exceeds the 50% threshold while the other falls below it).
  • Figure 2: The pipeline for on-watch foreground speech prediction. $P_{frame1}$ and $P_{frame2}$ denote the probabilities of foreground speech for frame 1 and frame 2, respectively.
  • Figure 3: On-watch SocialPulse pipeline for automatic social interaction detection.
  • Figure 4: SocialPulse feedback mechanisms for automatically detected social interactions: (a) real-time notification for marking whether interaction occurred and (b) a more detailed interface for annotating interactions.
  • Figure 5: Two features of SocialPulse: (a) editing a detected interaction and (b) responding to notifications about missed interactions.
  • ...and 9 more figures