G-SciEdBERT: A Contextualized LLM for Science Assessment Tasks in German
Ehsan Latif, Gyeong-Geon Lee, Knut Neumann, Tamara Kastorff, Xiaoming Zhai
TL;DR
This work addresses the challenge of automatically scoring German-written science responses by introducing G-SciEdBERT, a domain-specific, contextualized LLM built on G-BERT. It uses domain-focused pre-training on over 30,000 student responses from PISA 2018 and fine-tuning on 27–32 items from PISA datasets, leveraging a [CLS] embedding derived from a fixed G-BERT encoder to handle long inputs, followed by a linear softmax classifier. Empirical results show a significant improvement over G-BERT, with an average quadratic weighted Kappa increase of $0.1026$ (≈10.2%), and robust performance across longer sentences and technical terminology, including a notable 14.1% gain on item S131Q04. The study demonstrates the value of domain-specific pre-training for automated scoring in education, provides strong evidence for cross-prompt applicability, and contributes open-source resources for reproducibility and broader adoption in educational assessment across languages and disciplines.
Abstract
The advancement of natural language processing has paved the way for automated scoring systems in various languages, such as German (e.g., German BERT [G-BERT]). Automatically scoring written responses to science questions in German is a complex task and challenging for standard G-BERT as they lack contextual knowledge in the science domain and may be unaligned with student writing styles. This paper presents a contextualized German Science Education BERT (G-SciEdBERT), an innovative large language model tailored for scoring German-written responses to science tasks and beyond. Using G-BERT, we pre-trained G-SciEdBERT on a corpus of 30K German written science responses with 3M tokens on the Programme for International Student Assessment (PISA) 2018. We fine-tuned G-SciEdBERT on an additional 20K student-written responses with 2M tokens and examined the scoring accuracy. We then compared its scoring performance with G-BERT. Our findings revealed a substantial improvement in scoring accuracy with G-SciEdBERT, demonstrating a 10.2% increase of quadratic weighted Kappa compared to G-BERT (mean difference = 0.1026, SD = 0.069). These insights underline the significance of specialized language models like G-SciEdBERT, which is trained to enhance the accuracy of contextualized automated scoring, offering a substantial contribution to the field of AI in education.
