Examining the Robustness of Large Language Models across Language Complexity
Jiayi Zhang
TL;DR
This study addresses the problem of how robust LLM-based detectors are when processing student textual artifacts across language complexity. It leverages CueThink Thinklets, defines SRL constructs within the Winne SMART framework, and uses OpenAI embeddings with a simple neural net to detect SRL indicators. The authors evaluate robustness across lexical, syntactic, and semantic dimensions using Mass, syntactic simplicity, and deep cohesion, revealing construct-specific sensitivity to language features. The results show strong overall performance with construct-specific sensitivity to language features, highlighting implications for fairness and deployment of LLM-assisted educational tools.
Abstract
With the advancement of large language models (LLMs), an increasing number of student models have leveraged LLMs to analyze textual artifacts generated by students to understand and evaluate their learning. These student models typically employ pre-trained LLMs to vectorize text inputs into embeddings and then use the embeddings to train models to detect the presence or absence of a construct of interest. However, how reliable and robust are these models at processing language with different levels of complexity? In the context of learning where students may have different language backgrounds with various levels of writing skills, it is critical to examine the robustness of such models to ensure that these models work equally well for text with varying levels of language complexity. Coincidentally, a few (but limited) research studies show that the use of language can indeed impact the performance of LLMs. As such, in the current study, we examined the robustness of several LLM-based student models that detect student self-regulated learning (SRL) in math problem-solving. Specifically, we compared how the performance of these models vary using texts with high and low lexical, syntactic, and semantic complexity measured by three linguistic measures.
