Late fusion ensembles for speech recognition on diverse input audio representations
Marin Jezidžić, Matej Mihelčić
TL;DR
This work investigates late fusion ensembles for automatic speech recognition by training multiple E-Branchformer models on diverse input audio representations and fusing their outputs at score level. A generalized decoding framework combines per-model CTC and attention-based scores across representations using weighted sums, enabling parallel training and flexible integration with language models. Empirically, the approach yields 1–14% relative improvements over state-of-the-art baselines across LibriSpeech, Aishell, GigaSpeech, and TEDLIUMv2, with notable gains on certain targets (e.g., CER improvements on Aishell). The results demonstrate that representation diversity yields complementary information even for powerful models, and that more sophisticated fusion strategies could unlock further gains.
Abstract
We explore diverse representations of speech audio, and their effect on a performance of late fusion ensemble of E-Branchformer models, applied to Automatic Speech Recognition (ASR) task. Although it is generally known that ensemble methods often improve the performance of the system even for speech recognition, it is very interesting to explore how ensembles of complex state-of-the-art models, such as medium-sized and large E-Branchformers, cope in this setting when their base models are trained on diverse representations of the input speech audio. The results are evaluated on four widely-used benchmark datasets: \textit{Librispeech, Aishell, Gigaspeech}, \textit{TEDLIUMv2} and show that improvements of $1\% - 14\%$ can still be achieved over the state-of-the-art models trained using comparable techniques on these datasets. A noteworthy observation is that such ensemble offers improvements even with the use of language models, although the gap is closing.
