Table of Contents
Fetching ...

Subjective quality evaluation of personalized own voice reconstruction systems

Mattes Ohlenbusch, Christian Rollwage, Simon Doclo, Jan Rennies

TL;DR

This work evaluates personalized own voice reconstruction (OVR) for hearables by comparing generic and talker-specific systems trained with both generic and personalized data augmentation, followed by fine-tuning. Using a multi-microphone setup (outer and in-ear), the authors model the signals, implement FT-JNF-based OVR variants, and assess performance with both instrumental metrics and a MUSHRA-style listening test. Results show consistent subjective gains from OVR over baselines, with fine-tuned personalization delivering the strongest improvements, though gains are not universal across talkers. The study also reveals that many objective metrics do not reliably predict subjective quality for bandwidth-limited, body-conducted speech, though ESTOI and LEAP show relatively stronger alignment, highlighting the need for careful metric selection in evaluating OVR systems and personalizable enhancement strategies.

Abstract

Own voice pickup technology for hearable devices facilitates communication in noisy environments. Own voice reconstruction (OVR) systems enhance the quality and intelligibility of the recorded noisy own voice signals. Since disturbances affecting the recorded own voice signals depend on individual factors, personalized OVR systems have the potential to outperform generic OVR systems. In this paper, we propose personalizing OVR systems through data augmentation and fine-tuning, comparing them to their generic counterparts. We investigate the influence of personalization on speech quality assessed by objective metrics and conduct a subjective listening test to evaluate quality under various conditions. In addition, we assess the prediction accuracy of the objective metrics by comparing predicted quality with subjectively measured quality. Our findings suggest that personalized OVR provides benefits over generic OVR for some talkers only. Our results also indicate that performance comparisons between systems are not always accurately predicted by objective metrics. In particular, certain disturbances lead to a consistent overestimation of quality compared to actual subjective ratings.

Subjective quality evaluation of personalized own voice reconstruction systems

TL;DR

This work evaluates personalized own voice reconstruction (OVR) for hearables by comparing generic and talker-specific systems trained with both generic and personalized data augmentation, followed by fine-tuning. Using a multi-microphone setup (outer and in-ear), the authors model the signals, implement FT-JNF-based OVR variants, and assess performance with both instrumental metrics and a MUSHRA-style listening test. Results show consistent subjective gains from OVR over baselines, with fine-tuned personalization delivering the strongest improvements, though gains are not universal across talkers. The study also reveals that many objective metrics do not reliably predict subjective quality for bandwidth-limited, body-conducted speech, though ESTOI and LEAP show relatively stronger alignment, highlighting the need for careful metric selection in evaluating OVR systems and personalizable enhancement strategies.

Abstract

Own voice pickup technology for hearable devices facilitates communication in noisy environments. Own voice reconstruction (OVR) systems enhance the quality and intelligibility of the recorded noisy own voice signals. Since disturbances affecting the recorded own voice signals depend on individual factors, personalized OVR systems have the potential to outperform generic OVR systems. In this paper, we propose personalizing OVR systems through data augmentation and fine-tuning, comparing them to their generic counterparts. We investigate the influence of personalization on speech quality assessed by objective metrics and conduct a subjective listening test to evaluate quality under various conditions. In addition, we assess the prediction accuracy of the objective metrics by comparing predicted quality with subjectively measured quality. Our findings suggest that personalized OVR provides benefits over generic OVR for some talkers only. Our results also indicate that performance comparisons between systems are not always accurately predicted by objective metrics. In particular, certain disturbances lead to a consistent overestimation of quality compared to actual subjective ratings.

Paper Structure

This paper contains 28 sections, 6 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Block diagram of own voice reconstruction using an outer and an in-ear microphone of a hearable.
  • Figure 2: (a) ESTOI improvement achieved by OVR systems trained with different personalization strategies in data augmentation (DA) and fine-tuning (FT) and (b) individual difference in ESTOI improvement between the conditions Generic DA, generic FT and Generic DA, personalized FT. Individual data points denote the average over individual target talker test sets. The data points labeled with 0 and 6 correspond to the talkers selected for the listening experiment as the low predicted benefit and high predicted benefit, respectively.
  • Figure 3: (a) PESQ improvement achieved by OVR systems trained with different personalization strategies in data augmentation (DA) and fine-tuning (FT) and (b) individual difference in PESQ improvement between the conditions generic DA, generic FT and generic DA, personalized FT. Individual data points denote the average over individual target talker test sets. The data points labeled with 0 and 6 correspond to the talkers selected for the listening experiment as the low predicted benefit and high predicted benefit, respectively.
  • Figure 4: Spectrograms of the recorded noise signals used in the subjective evaluation.
  • Figure 5: Subjective MUSHRA quality ratings (averaged over sentences) for speech in the low predicted benefit case.
  • ...and 2 more figures