IHearYou: Linking Acoustic Features to DSM-5 Depressive Behavior Indicators
Jonas Länzlinger, Katharina Müller, Burkhard Stiller, Bruno Rodrigues
TL;DR
This work addresses the need for objective, interpretable, and privacy‑preserving depression assessment by linking acoustic speech features to DSM‑5 depressive indicators using a DSM‑5–aligned, on‑device Linkage Framework. The IHearYou system maps low‑level voice metrics to clinically meaningful indicators through transparent, testable rules, enabling explainable DSM‑5 scores without cloud processing. It demonstrates reproducible results on the DAIC‑WOZ dataset with a configuration‑driven protocol (including FDR control and gender stratification) and validates end‑to‑end feasibility via TESS streaming, all on commodity hardware. While directional feature–indicator associations emerge, the study notes limitations from sample size and calls for larger longitudinal, multimodal deployments to enhance robustness and clinical utility, while preserving edge privacy and interpretability.
Abstract
Depression affects over millions people worldwide, yet diagnosis still relies on subjective self-reports and interviews that may not capture authentic behavior. We present IHearYou, an approach to automated depression detection focused on speech acoustics. Using passive sensing in household environments, IHearYou extracts voice features and links them to DSM-5 (Diagnostic and Statistical Manual of Mental Disorders) indicators through a structured Linkage Framework instantiated for Major Depressive Disorder. The system runs locally to preserve privacy and includes a persistence schema and dashboard, presenting real-time throughput on a commodity laptop. To ensure reproducibility, we define a configuration-driven protocol with False Discovery Rate (FDR) correction and gender-stratified testing. Applied to the DAIC-WOZ dataset, this protocol reveals directionally consistent feature-indicator associations, while a TESS-based audio streaming experiment validates end-to-end feasibility. Our results show how passive voice sensing can be turned into explainable DSM-5 indicator scores, bridging the gap between black-box detection and clinically interpretable, on-device analysis.
