Table of Contents
Fetching ...

An Interpretable and Crosslingual Method for Evaluating Second-Language Dialogues

Rena Gao, Jingxuan Wu, Xuetong Wu, Carsten Roever, Jing Wu, Long Lv, Jey Han Lau

TL;DR

This study tests the cross-lingual transfer of an ESL-focused dialogue evaluation framework to Chinese-as-a-second-language (CSL) conversations by introducing CNIMA, a large annotated CSL dataset with 10K+ dialogues. It demonstrates language-universal and language-specific links between micro-level linguistic cues and macro-level interactivity, while proposing an automated, interpretable pipeline that predicts micro- and macro-level features and the overall dialogue quality without requiring labelled data. The approach leverages both classical models and large language models (BERT, GPT-4o) to deliver strong, interpretable scores (F1 > 0.80) and shows strong cross-lingual robustness, with potential to generalize to other languages and datasets. The work advances scalable, transparent second-language dialogue assessment by exposing the features driving quality and enabling adaptable evaluation across languages.

Abstract

We analyse the cross-lingual transferability of a dialogue evaluation framework that assesses the relationships between micro-level linguistic features (e.g. backchannels) and macro-level interactivity labels (e.g. topic management), originally designed for English-as-a-second-language dialogues. To this end, we develop CNIMA (Chinese Non-Native Interactivity Measurement and Automation), a Chinese-as-a-second-language labelled dataset with 10K dialogues. We found the evaluation framework to be robust across distinct languages: English and Chinese, revealing language-specific and language-universal relationships between micro-level and macro-level features. Next, we propose an automated, interpretable approach with low data requirement that scores the overall quality of a second-language dialogue based on the framework. Our approach is interpretable in that it reveals the key linguistic and interactivity features that contributed to the overall quality score. As our approach does not require labelled data, it can also be adapted to other languages for second-language dialogue evaluation.

An Interpretable and Crosslingual Method for Evaluating Second-Language Dialogues

TL;DR

This study tests the cross-lingual transfer of an ESL-focused dialogue evaluation framework to Chinese-as-a-second-language (CSL) conversations by introducing CNIMA, a large annotated CSL dataset with 10K+ dialogues. It demonstrates language-universal and language-specific links between micro-level linguistic cues and macro-level interactivity, while proposing an automated, interpretable pipeline that predicts micro- and macro-level features and the overall dialogue quality without requiring labelled data. The approach leverages both classical models and large language models (BERT, GPT-4o) to deliver strong, interpretable scores (F1 > 0.80) and shows strong cross-lingual robustness, with potential to generalize to other languages and datasets. The work advances scalable, transparent second-language dialogue assessment by exposing the features driving quality and enabling adaptable evaluation across languages.

Abstract

We analyse the cross-lingual transferability of a dialogue evaluation framework that assesses the relationships between micro-level linguistic features (e.g. backchannels) and macro-level interactivity labels (e.g. topic management), originally designed for English-as-a-second-language dialogues. To this end, we develop CNIMA (Chinese Non-Native Interactivity Measurement and Automation), a Chinese-as-a-second-language labelled dataset with 10K dialogues. We found the evaluation framework to be robust across distinct languages: English and Chinese, revealing language-specific and language-universal relationships between micro-level and macro-level features. Next, we propose an automated, interpretable approach with low data requirement that scores the overall quality of a second-language dialogue based on the framework. Our approach is interpretable in that it reveals the key linguistic and interactivity features that contributed to the overall quality score. As our approach does not require labelled data, it can also be adapted to other languages for second-language dialogue evaluation.
Paper Structure (28 sections, 2 equations, 4 figures, 16 tables)

This paper contains 28 sections, 2 equations, 4 figures, 16 tables.

Figures (4)

  • Figure 1: An example of a CSL dialogue annotated with the micro-level features, macro-level interactivity labels and overall dialogue quality score.
  • Figure 2: Pipeline for automated scoring of the CSL dialogue on three steps
  • Figure 3: Annotation tool Demo
  • Figure 4: Hierarchical Label Assignment Demo