Table of Contents
Fetching ...

Multi-Channel MOSRA: Mean Opinion Score and Room Acoustics Estimation Using Simulated Data and a Teacher Model

Jozef Coldenhoff, Andrew Harper, Paul Kendrick, Tijana Stojkovic, Milos Cernak

TL;DR

This work addresses the challenge of predicting both speech quality (MOS) and room acoustics for multi-channel recordings to inform device selection in multi-microphone environments. It extends MOSRA with a five-channel feature extractor and per-channel prediction heads, trained on a large synthetic pipeline that combines room impulse responses, speech, noise, and a teacher MOS model to provide labels. The results show improved accuracy for acoustic descriptors such as STI, DRR, and C50 compared to a single-channel baseline while achieving roughly 5× lower computation per channel, though MOS prediction benefits from multi-channel training are mixed due to distribution gaps between simulated and real data. The findings suggest simulated data can generalize to real datasets in-distribution, but distribution shifts across datasets limit MOS robustness, pointing to future work on richer degradations and more diverse real-world validation to enable reliable, real-time, multi-channel quality-based device selection.

Abstract

Previous methods for predicting room acoustic parameters and speech quality metrics have focused on the single-channel case, where room acoustics and Mean Opinion Score (MOS) are predicted for a single recording device. However, quality-based device selection for rooms with multiple recording devices may benefit from a multi-channel approach where the descriptive metrics are predicted for multiple devices in parallel. Following our hypothesis that a model may benefit from multi-channel training, we develop a multi-channel model for joint MOS and room acoustics prediction (MOSRA) for five channels in parallel. The lack of multi-channel audio data with ground truth labels necessitated the creation of simulated data using an acoustic simulator with room acoustic labels extracted from the generated impulse responses and labels for MOS generated in a student-teacher setup using a wav2vec2-based MOS prediction model. Our experiments show that the multi-channel model improves the prediction of the direct-to-reverberation ratio, clarity, and speech transmission index over the single-channel model with roughly 5$\times$ less computation while suffering minimal losses in the performance of the other metrics.

Multi-Channel MOSRA: Mean Opinion Score and Room Acoustics Estimation Using Simulated Data and a Teacher Model

TL;DR

This work addresses the challenge of predicting both speech quality (MOS) and room acoustics for multi-channel recordings to inform device selection in multi-microphone environments. It extends MOSRA with a five-channel feature extractor and per-channel prediction heads, trained on a large synthetic pipeline that combines room impulse responses, speech, noise, and a teacher MOS model to provide labels. The results show improved accuracy for acoustic descriptors such as STI, DRR, and C50 compared to a single-channel baseline while achieving roughly 5× lower computation per channel, though MOS prediction benefits from multi-channel training are mixed due to distribution gaps between simulated and real data. The findings suggest simulated data can generalize to real datasets in-distribution, but distribution shifts across datasets limit MOS robustness, pointing to future work on richer degradations and more diverse real-world validation to enable reliable, real-time, multi-channel quality-based device selection.

Abstract

Previous methods for predicting room acoustic parameters and speech quality metrics have focused on the single-channel case, where room acoustics and Mean Opinion Score (MOS) are predicted for a single recording device. However, quality-based device selection for rooms with multiple recording devices may benefit from a multi-channel approach where the descriptive metrics are predicted for multiple devices in parallel. Following our hypothesis that a model may benefit from multi-channel training, we develop a multi-channel model for joint MOS and room acoustics prediction (MOSRA) for five channels in parallel. The lack of multi-channel audio data with ground truth labels necessitated the creation of simulated data using an acoustic simulator with room acoustic labels extracted from the generated impulse responses and labels for MOS generated in a student-teacher setup using a wav2vec2-based MOS prediction model. Our experiments show that the multi-channel model improves the prediction of the direct-to-reverberation ratio, clarity, and speech transmission index over the single-channel model with roughly 5 less computation while suffering minimal losses in the performance of the other metrics.
Paper Structure (14 sections, 2 equations, 3 figures, 3 tables)

This paper contains 14 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of the proposed framework. On the left, a high-level overview of the data generation process is given. On the right, the details of the model architecture are shown, with the number of parameters shown in the red boxes.
  • Figure 2: T60 and DRR versus distance to the active speaker.
  • Figure 3: Multi-channel MOSRA predictions using a circular buffer of roughly 4 seconds of audio. The audio is recorded in a real room where the speaker is crossfaded between recording devices placed in three spatially disjoint locations. The model makes predictions on three channels, where the first two are repeated, e.g., the input to the model is channel [1,2,3,1,2]. Note that the scores are standardized across channels for each time step to aid interpretability.