Table of Contents
Fetching ...

Automated Speaking Assessment of Conversation Tests with Novel Graph-based Modeling on Spoken Response Coherence

Jiun-Ting Li, Bi-Cheng Yan, Tien-Hong Lo, Yi-Cheng Wang, Yung-Chang Hsu, Berlin Chen

TL;DR

Problem: automated speaking assessment of conversation tests must account for coherence across turns to accurately judge L2 proficiency. Approach: a hierarchical graph modeling framework (EHGM) couples a contextual LM with multi-level graphs that encode semantically related words, intra-response SPO actions, and inter-response discourse, with fusion at the regressor stage to predict $\\hat{Y}$. Contributions: (1) enhanced hierarchical graph modeling of coherence, (2) integration strategy for hierarchical context into holistic scoring, and (3) publicly available code and preprocessing. Findings: on the NICT-JLE benchmark, the proposed method yields substantial improvements over strong baselines in RMSE, PCC, and margin-accuracy, highlighting coherence-aware representations as key for accurate ASAC. Significance: enables more reliable, interpretable automatic assessment of spoken proficiency in conversational settings and informs future coherence-aware language assessment research.

Abstract

Automated speaking assessment in conversation tests (ASAC) aims to evaluate the overall speaking proficiency of an L2 (second-language) speaker in a setting where an interlocutor interacts with one or more candidates. Although prior ASAC approaches have shown promising performance on their respective datasets, there is still a dearth of research specifically focused on incorporating the coherence of the logical flow within a conversation into the grading model. To address this critical challenge, we propose a hierarchical graph model that aptly incorporates both broad inter-response interactions (e.g., discourse relations) and nuanced semantic information (e.g., semantic words and speaker intents), which is subsequently fused with contextual information for the final prediction. Extensive experimental results on the NICT-JLE benchmark dataset suggest that our proposed modeling approach can yield considerable improvements in prediction accuracy with respect to various assessment metrics, as compared to some strong baselines. This also sheds light on the importance of investigating coherence-related facets of spoken responses in ASAC.

Automated Speaking Assessment of Conversation Tests with Novel Graph-based Modeling on Spoken Response Coherence

TL;DR

Problem: automated speaking assessment of conversation tests must account for coherence across turns to accurately judge L2 proficiency. Approach: a hierarchical graph modeling framework (EHGM) couples a contextual LM with multi-level graphs that encode semantically related words, intra-response SPO actions, and inter-response discourse, with fusion at the regressor stage to predict . Contributions: (1) enhanced hierarchical graph modeling of coherence, (2) integration strategy for hierarchical context into holistic scoring, and (3) publicly available code and preprocessing. Findings: on the NICT-JLE benchmark, the proposed method yields substantial improvements over strong baselines in RMSE, PCC, and margin-accuracy, highlighting coherence-aware representations as key for accurate ASAC. Significance: enables more reliable, interpretable automatic assessment of spoken proficiency in conversational settings and informs future coherence-aware language assessment research.

Abstract

Automated speaking assessment in conversation tests (ASAC) aims to evaluate the overall speaking proficiency of an L2 (second-language) speaker in a setting where an interlocutor interacts with one or more candidates. Although prior ASAC approaches have shown promising performance on their respective datasets, there is still a dearth of research specifically focused on incorporating the coherence of the logical flow within a conversation into the grading model. To address this critical challenge, we propose a hierarchical graph model that aptly incorporates both broad inter-response interactions (e.g., discourse relations) and nuanced semantic information (e.g., semantic words and speaker intents), which is subsequently fused with contextual information for the final prediction. Extensive experimental results on the NICT-JLE benchmark dataset suggest that our proposed modeling approach can yield considerable improvements in prediction accuracy with respect to various assessment metrics, as compared to some strong baselines. This also sheds light on the importance of investigating coherence-related facets of spoken responses in ASAC.
Paper Structure (17 sections, 3 equations, 4 figures, 1 table)

This paper contains 17 sections, 3 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: An illustration of a partial conversation test sample in the NICT JLE izumi_nict_2004 dataset. Each sentence is the response to the conversation. The conversation test is held under specific topics and with the participation of an interlocutor and a candidate.
  • Figure 2: This illustrates the ASAC grading model framework. From bottom to top, it includes: (1) Two processing stages for conversational spoken content—concatenating and splitting—to prepare model inputs. (2) The left module is a contextualized encoder for sequential input, and the right module models hierarchical contexts in inter- and intra-responses with hierarchical levels and a bottom-to-top propagation path. Hierarchical context in conversation data is proposed to implement coherence. At the response level (intra-responses), it aggregates semantic information from semantically related words and intents from the SPO tuple. This response information then propagates to the discourse level for inter-response interaction. (3) The graph-based representation $\textbf{H}^{G}$ is fused with the mean pooled embedding of $\textbf{H}^{B}$ which is derived from the sequential model, to form the final decision $Y$.
  • Figure 3: This illustration is obtained from the evaluation results of our proposed mothed (BERT+CDA). The confusion matrix depicts the percentage of prediction on the right position as targets in CEFR levels.
  • Figure 4: This illustration is obtained from the evaluation results of our proposed mothed (C+D+A).