Table of Contents
Fetching ...

Optimizing Automatic Speech Assessment: W-RankSim Regularization and Hybrid Feature Fusion Strategies

Chung-Wen Wu, Berlin Chen

TL;DR

This work tackles data imbalance in Automatic Speech Assessment by framing ASA as an imbalanced ordinal classification task and introducing Weighted Vectors Ranking Similarity (W-RankSim) as a regularizer. W-RankSim aligns the ordinal structure of class labels with the geometry of weight vectors in the output layer, combining with the primary loss via $L_{total} = L_{main} + \gamma L_{W-RankSim}$ to improve training, and it avoids the need for very large batches unlike RankSim. The authors also propose a hybrid model that fuses self-supervised features (e.g., Whisper, wav2vec 2.0) with handcrafted features across content, delivery, and language-use components. Experiments on the GEPT corpus show consistent gains from W-RankSim, with the LMCL + W-RankSim hybrid achieving the strongest performance across varying batch sizes, highlighting robustness to data imbalance and batch constraints. Overall, the approach advances ASA by leveraging ordinal information and feature fusion, with potential applicability to other imbalanced ordinal tasks.

Abstract

Automatic Speech Assessment (ASA) has seen notable advancements with the utilization of self-supervised features (SSL) in recent research. However, a key challenge in ASA lies in the imbalanced distribution of data, particularly evident in English test datasets. To address this challenge, we approach ASA as an ordinal classification task, introducing Weighted Vectors Ranking Similarity (W-RankSim) as a novel regularization technique. W-RankSim encourages closer proximity of weighted vectors in the output layer for similar classes, implying that feature vectors with similar labels would be gradually nudged closer to each other as they converge towards corresponding weighted vectors. Extensive experimental evaluations confirm the effectiveness of our approach in improving ordinal classification performance for ASA. Furthermore, we propose a hybrid model that combines SSL and handcrafted features, showcasing how the inclusion of handcrafted features enhances performance in an ASA system.

Optimizing Automatic Speech Assessment: W-RankSim Regularization and Hybrid Feature Fusion Strategies

TL;DR

This work tackles data imbalance in Automatic Speech Assessment by framing ASA as an imbalanced ordinal classification task and introducing Weighted Vectors Ranking Similarity (W-RankSim) as a regularizer. W-RankSim aligns the ordinal structure of class labels with the geometry of weight vectors in the output layer, combining with the primary loss via to improve training, and it avoids the need for very large batches unlike RankSim. The authors also propose a hybrid model that fuses self-supervised features (e.g., Whisper, wav2vec 2.0) with handcrafted features across content, delivery, and language-use components. Experiments on the GEPT corpus show consistent gains from W-RankSim, with the LMCL + W-RankSim hybrid achieving the strongest performance across varying batch sizes, highlighting robustness to data imbalance and batch constraints. Overall, the approach advances ASA by leveraging ordinal information and feature fusion, with potential applicability to other imbalanced ordinal tasks.

Abstract

Automatic Speech Assessment (ASA) has seen notable advancements with the utilization of self-supervised features (SSL) in recent research. However, a key challenge in ASA lies in the imbalanced distribution of data, particularly evident in English test datasets. To address this challenge, we approach ASA as an ordinal classification task, introducing Weighted Vectors Ranking Similarity (W-RankSim) as a novel regularization technique. W-RankSim encourages closer proximity of weighted vectors in the output layer for similar classes, implying that feature vectors with similar labels would be gradually nudged closer to each other as they converge towards corresponding weighted vectors. Extensive experimental evaluations confirm the effectiveness of our approach in improving ordinal classification performance for ASA. Furthermore, we propose a hybrid model that combines SSL and handcrafted features, showcasing how the inclusion of handcrafted features enhances performance in an ASA system.
Paper Structure (15 sections, 7 equations, 4 figures, 1 table)

This paper contains 15 sections, 7 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Illustration of our proposed regularization W-RankSim vs. RankSim. RankSim requires the last feature embeddings and labels in a batch, whereas W-RankSim leverages weighted vectors in the output layer without being constrained on batch size to achieve a similar effect.
  • Figure 2: The architecture of the Hybrid ASA model comprises three parts: (a) content, (b) delivery, and (c) language use, each addressing specific aspects of speech assessment. The first part utilizes SSL features generated by a pretrained acoustic model such as Whisper. The remaining parts leverage hand-crafted features to capture relevant characteristics for speech assessment.
  • Figure 3: The left figure shows response length distribution in the GEPT corpus, while the other displays score distribution.
  • Figure 4: Experiments tested a hybrid model that combined LMCL with different regularization across different batch sizes.