G-SciEdBERT: A Contextualized LLM for Science Assessment Tasks in German

Ehsan Latif; Gyeong-Geon Lee; Knut Neumann; Tamara Kastorff; Xiaoming Zhai

G-SciEdBERT: A Contextualized LLM for Science Assessment Tasks in German

Ehsan Latif, Gyeong-Geon Lee, Knut Neumann, Tamara Kastorff, Xiaoming Zhai

TL;DR

This work addresses the challenge of automatically scoring German-written science responses by introducing G-SciEdBERT, a domain-specific, contextualized LLM built on G-BERT. It uses domain-focused pre-training on over 30,000 student responses from PISA 2018 and fine-tuning on 27–32 items from PISA datasets, leveraging a [CLS] embedding derived from a fixed G-BERT encoder to handle long inputs, followed by a linear softmax classifier. Empirical results show a significant improvement over G-BERT, with an average quadratic weighted Kappa increase of $0.1026$ (≈10.2%), and robust performance across longer sentences and technical terminology, including a notable 14.1% gain on item S131Q04. The study demonstrates the value of domain-specific pre-training for automated scoring in education, provides strong evidence for cross-prompt applicability, and contributes open-source resources for reproducibility and broader adoption in educational assessment across languages and disciplines.

Abstract

The advancement of natural language processing has paved the way for automated scoring systems in various languages, such as German (e.g., German BERT [G-BERT]). Automatically scoring written responses to science questions in German is a complex task and challenging for standard G-BERT as they lack contextual knowledge in the science domain and may be unaligned with student writing styles. This paper presents a contextualized German Science Education BERT (G-SciEdBERT), an innovative large language model tailored for scoring German-written responses to science tasks and beyond. Using G-BERT, we pre-trained G-SciEdBERT on a corpus of 30K German written science responses with 3M tokens on the Programme for International Student Assessment (PISA) 2018. We fine-tuned G-SciEdBERT on an additional 20K student-written responses with 2M tokens and examined the scoring accuracy. We then compared its scoring performance with G-BERT. Our findings revealed a substantial improvement in scoring accuracy with G-SciEdBERT, demonstrating a 10.2% increase of quadratic weighted Kappa compared to G-BERT (mean difference = 0.1026, SD = 0.069). These insights underline the significance of specialized language models like G-SciEdBERT, which is trained to enhance the accuracy of contextualized automated scoring, offering a substantial contribution to the field of AI in education.

G-SciEdBERT: A Contextualized LLM for Science Assessment Tasks in German

TL;DR

(≈10.2%), and robust performance across longer sentences and technical terminology, including a notable 14.1% gain on item S131Q04. The study demonstrates the value of domain-specific pre-training for automated scoring in education, provides strong evidence for cross-prompt applicability, and contributes open-source resources for reproducibility and broader adoption in educational assessment across languages and disciplines.

Abstract

Paper Structure (15 sections, 1 equation, 2 figures, 2 tables)

This paper contains 15 sections, 1 equation, 2 figures, 2 tables.

Introduction
Background
Contextualized Large Language Models
Cross-Prompt Automatic Scoring
Method
Problem Formulation
Dataset Details
Our Approach: G-SciEdBERT
Experimental Evaluation
Metrics and Baseline
Implementation Details
Results
Discussion
Conclusion
Accuracy Results for all 27 items

Figures (2)

Figure 1: System Architecture: G-SciEdBERT pretraining and fine-tuning to score German written responses automatically.
Figure 2: Visualization of the effect of item-wise response length and the count of scientific words in student written responses with G-BERT and G-SciEdBERT accuracy. (Comparison plots for all 27 items). The less steep slope of the trend lines delineates the reduced effect of average sentence length and average scientific words against G-SciEdBERT accuracy.

G-SciEdBERT: A Contextualized LLM for Science Assessment Tasks in German

TL;DR

Abstract

G-SciEdBERT: A Contextualized LLM for Science Assessment Tasks in German

Authors

TL;DR

Abstract

Table of Contents

Figures (2)