NLP and Education: using semantic similarity to evaluate filled gaps in a large-scale Cloze test in the classroom

Túlio Sousa de Gois; Flávia Oliveira Freitas; Julian Tejada; Raquel Meister Ko. Freitag

NLP and Education: using semantic similarity to evaluate filled gaps in a large-scale Cloze test in the classroom

Túlio Sousa de Gois, Flávia Oliveira Freitas, Julian Tejada, Raquel Meister Ko. Freitag

TL;DR

This work tackles the scalability of Cloze tests in classroom settings by proposing an automated correction approach based on semantic similarity derived from word embeddings in Brazilian Portuguese. It systematically compares three WE models—GloVe, Wang2Vec, and spaCy pt_core_news_lg—using cosine similarity to align student answers with expected responses, validated against human judges. The results show that GloVe provides the strongest agreement with human judgments, and there is no significant difference among the model rankings, supporting the viability of semantic-similarity-based automatic scoring for large-scale Cloze assessments. The study offers a practical, corpus-based method to speed up reading proficiency evaluation and points to future work exploring deep-learning embeddings to further improve performance.

Abstract

This study examines the applicability of the Cloze test, a widely used tool for assessing text comprehension proficiency, while highlighting its challenges in large-scale implementation. To address these limitations, an automated correction approach was proposed, utilizing Natural Language Processing (NLP) techniques, particularly word embeddings (WE) models, to assess semantic similarity between expected and provided answers. Using data from Cloze tests administered to students in Brazil, WE models for Brazilian Portuguese (PT-BR) were employed to measure the semantic similarity of the responses. The results were validated through an experimental setup involving twelve judges who classified the students' answers. A comparative analysis between the WE models' scores and the judges' evaluations revealed that GloVe was the most effective model, demonstrating the highest correlation with the judges' assessments. This study underscores the utility of WE models in evaluating semantic similarity and their potential to enhance large-scale Cloze test assessments. Furthermore, it contributes to educational assessment methodologies by offering a more efficient approach to evaluating reading proficiency.

NLP and Education: using semantic similarity to evaluate filled gaps in a large-scale Cloze test in the classroom

TL;DR

Abstract

NLP and Education: using semantic similarity to evaluate filled gaps in a large-scale Cloze test in the classroom

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)