Table of Contents
Fetching ...

Deep Learning for Assessment of Oral Reading Fluency

Mithilesh Vaidya, Binaya Kumar Sahoo, Preeti Rao

TL;DR

The paper tackles automatic assessment of oral reading fluency using end-to-end wav2vec2.0-based architectures to predict expert comprehensibility ratings from children’s audio. It systematically compares Vanilla and Aligned architectures, evaluates multiple pre-trained models, and explores layer-weighting and representation probing to interpret learned features. Results show that frame-level wav2vec representations can surpass traditional hand-crafted features, with embeddings correlating to fluency aspects such as speech rate and prosodic cues; intermediate transformer layers often provide the most useful information. The work suggests a scalable path for fluency assessment leveraging large unlabeled datasets and motivates future fusion of complementary features to further boost performance.

Abstract

Reading fluency assessment is a critical component of literacy programmes, serving to guide and monitor early education interventions. Given the resource intensive nature of the exercise when conducted by teachers, the development of automatic tools that can operate on audio recordings of oral reading is attractive as an objective and highly scalable solution. Multiple complex aspects such as accuracy, rate and expressiveness underlie human judgements of reading fluency. In this work, we investigate end-to-end modeling on a training dataset of children's audio recordings of story texts labeled by human experts. The pre-trained wav2vec2.0 model is adopted due its potential to alleviate the challenges from the limited amount of labeled data. We report the performance of a number of system variations on the relevant measures, and also probe the learned embeddings for lexical and acoustic-prosodic features known to be important to the perception of reading fluency.

Deep Learning for Assessment of Oral Reading Fluency

TL;DR

The paper tackles automatic assessment of oral reading fluency using end-to-end wav2vec2.0-based architectures to predict expert comprehensibility ratings from children’s audio. It systematically compares Vanilla and Aligned architectures, evaluates multiple pre-trained models, and explores layer-weighting and representation probing to interpret learned features. Results show that frame-level wav2vec representations can surpass traditional hand-crafted features, with embeddings correlating to fluency aspects such as speech rate and prosodic cues; intermediate transformer layers often provide the most useful information. The work suggests a scalable path for fluency assessment leveraging large unlabeled datasets and motivates future fusion of complementary features to further boost performance.

Abstract

Reading fluency assessment is a critical component of literacy programmes, serving to guide and monitor early education interventions. Given the resource intensive nature of the exercise when conducted by teachers, the development of automatic tools that can operate on audio recordings of oral reading is attractive as an objective and highly scalable solution. Multiple complex aspects such as accuracy, rate and expressiveness underlie human judgements of reading fluency. In this work, we investigate end-to-end modeling on a training dataset of children's audio recordings of story texts labeled by human experts. The pre-trained wav2vec2.0 model is adopted due its potential to alleviate the challenges from the limited amount of labeled data. We report the performance of a number of system variations on the relevant measures, and also probe the learned embeddings for lexical and acoustic-prosodic features known to be important to the perception of reading fluency.
Paper Structure (13 sections, 2 equations, 4 figures, 2 tables)

This paper contains 13 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: W2Vanilla Architecture
  • Figure 2: W2VAligned Architecture
  • Figure 3: Location of probes in the Vanilla architecture. C is obtained on mean pooling the frame-level representations extracted from a pre-trained (frozen) wav2vec model. On passing it through 3 hidden layers with [128, 64, 4] hidden units, we get a compressed representation B $\in \mathbb{R}^4$. We regress the final score from this representation.
  • Figure 4: $P^f_c$ (Performance on wav2vec embedding), $P^f_b$ (Performance on bottleneck embedding) and the ratio sorted in descending order according to the ratio. The HC (hand-crafted) features are taken from kamini_thesis.