Subjective quality evaluation of personalized own voice reconstruction systems
Mattes Ohlenbusch, Christian Rollwage, Simon Doclo, Jan Rennies
TL;DR
This work evaluates personalized own voice reconstruction (OVR) for hearables by comparing generic and talker-specific systems trained with both generic and personalized data augmentation, followed by fine-tuning. Using a multi-microphone setup (outer and in-ear), the authors model the signals, implement FT-JNF-based OVR variants, and assess performance with both instrumental metrics and a MUSHRA-style listening test. Results show consistent subjective gains from OVR over baselines, with fine-tuned personalization delivering the strongest improvements, though gains are not universal across talkers. The study also reveals that many objective metrics do not reliably predict subjective quality for bandwidth-limited, body-conducted speech, though ESTOI and LEAP show relatively stronger alignment, highlighting the need for careful metric selection in evaluating OVR systems and personalizable enhancement strategies.
Abstract
Own voice pickup technology for hearable devices facilitates communication in noisy environments. Own voice reconstruction (OVR) systems enhance the quality and intelligibility of the recorded noisy own voice signals. Since disturbances affecting the recorded own voice signals depend on individual factors, personalized OVR systems have the potential to outperform generic OVR systems. In this paper, we propose personalizing OVR systems through data augmentation and fine-tuning, comparing them to their generic counterparts. We investigate the influence of personalization on speech quality assessed by objective metrics and conduct a subjective listening test to evaluate quality under various conditions. In addition, we assess the prediction accuracy of the objective metrics by comparing predicted quality with subjectively measured quality. Our findings suggest that personalized OVR provides benefits over generic OVR for some talkers only. Our results also indicate that performance comparisons between systems are not always accurately predicted by objective metrics. In particular, certain disturbances lead to a consistent overestimation of quality compared to actual subjective ratings.
