Table of Contents
Fetching ...

When Less Is More? Diagnosing ASR Predictions in Sardinian via Layer-Wise Decoding

Domenico De Cristofaro, Alessandro Vietti, Marianne Pouplier, Aleese Block

TL;DR

The paper investigates layer-wise decoding in a pretrained multilingual ASR model to understand phoneme-level representations in Sardinian, a low-resource language. By truncating encoder layers and decoding from intermediate layers, the authors demonstrate that certain early-to-mid encoder layers yield lower PER than the full model, challenging the assumption that more context always improves phoneme accuracy. They introduce regressive errors to capture cases where deeper layers overwrite correct intermediate predictions, and provide both quantitative PER trends and qualitative analyses of specific utterances. The work highlights the value of layer-wise probing as a diagnostic tool for ASR in low-resource settings and suggests that traditional metrics may obscure meaningful representational dynamics across layers.

Abstract

Recent studies have shown that intermediate layers in multilingual speech models often encode more phonetically accurate representations than the final output layer. In this work, we apply a layer-wise decoding strategy to a pretrained Wav2Vec2 model to investigate how phoneme-level predictions evolve across encoder layers, focusing on Campidanese Sardinian, a low-resource language. We show that truncating upper transformer layers leads to improved Phoneme Error Rates (PER), with the best performance achieved not at the final layer, but two layers earlier. Through fine-grained alignment analysis, we find that intermediate predictions better preserve segmental identity, avoid overgeneration, and reduce certain classes of phonological errors. We also introduce the notion of regressive errors, cases where correct predictions at intermediate layers are overwritten by errors at the final layer. These regressions highlight the limitations of surface-level error metrics and reveal how deeper layers may generalize or abstract away from acoustic detail. Our findings support the use of early-layer probing as a diagnostic tool for ASR models, particularly in low-resource settings where standard evaluation metrics may fail to capture linguistically meaningful behavior.

When Less Is More? Diagnosing ASR Predictions in Sardinian via Layer-Wise Decoding

TL;DR

The paper investigates layer-wise decoding in a pretrained multilingual ASR model to understand phoneme-level representations in Sardinian, a low-resource language. By truncating encoder layers and decoding from intermediate layers, the authors demonstrate that certain early-to-mid encoder layers yield lower PER than the full model, challenging the assumption that more context always improves phoneme accuracy. They introduce regressive errors to capture cases where deeper layers overwrite correct intermediate predictions, and provide both quantitative PER trends and qualitative analyses of specific utterances. The work highlights the value of layer-wise probing as a diagnostic tool for ASR in low-resource settings and suggests that traditional metrics may obscure meaningful representational dynamics across layers.

Abstract

Recent studies have shown that intermediate layers in multilingual speech models often encode more phonetically accurate representations than the final output layer. In this work, we apply a layer-wise decoding strategy to a pretrained Wav2Vec2 model to investigate how phoneme-level predictions evolve across encoder layers, focusing on Campidanese Sardinian, a low-resource language. We show that truncating upper transformer layers leads to improved Phoneme Error Rates (PER), with the best performance achieved not at the final layer, but two layers earlier. Through fine-grained alignment analysis, we find that intermediate predictions better preserve segmental identity, avoid overgeneration, and reduce certain classes of phonological errors. We also introduce the notion of regressive errors, cases where correct predictions at intermediate layers are overwritten by errors at the final layer. These regressions highlight the limitations of surface-level error metrics and reveal how deeper layers may generalize or abstract away from acoustic detail. Our findings support the use of early-layer probing as a diagnostic tool for ASR models, particularly in low-resource settings where standard evaluation metrics may fail to capture linguistically meaningful behavior.
Paper Structure (12 sections, 2 figures, 5 tables)

This paper contains 12 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Heatmaps of the most frequent phoneme deletions and substitutions (ref $\rightarrow$ pred) as the number of removed transformer layers increases.
  • Figure 2: Trends of phoneme-level error types across effective encoder layers. While deletion errors decrease as more layers are retained, substitution errors increase. Although the number of correctly predicted phonemes (hits) also increases, it is possible that many previously deleted segments are now realized as incorrect substitutions.