Table of Contents
Fetching ...

Classifying German Language Proficiency Levels Using Large Language Models

Elias-Leander Ahlers, Witold Brunsmann, Malte Schilling

TL;DR

This work assesses how Large Language Models can classify German texts into CEFR proficiency levels. By constructing a balanced, augmented dataset and evaluating prompting, fine-tuning, and probing strategies, it demonstrates substantial performance gains over traditional methods. The fine-tuned LLaMA-3-8B-Instruct model achieves a weighted F1 of 0.769 and perfect group accuracy, while probing the model’s internal states provides additional benefits. The study highlights the practical potential of LLM-based CEFR assessment and suggests avenues for synthetic data generation to expand labeled resources.

Abstract

Assessing language proficiency is essential for education, as it enables instruction tailored to learners needs. This paper investigates the use of Large Language Models (LLMs) for automatically classifying German texts according to the Common European Framework of Reference for Languages (CEFR) into different proficiency levels. To support robust training and evaluation, we construct a diverse dataset by combining multiple existing CEFR-annotated corpora with synthetic data. We then evaluate prompt-engineering strategies, fine-tuning of a LLaMA-3-8B-Instruct model and a probing-based approach that utilizes the internal neural state of the LLM for classification. Our results show a consistent performance improvement over prior methods, highlighting the potential of LLMs for reliable and scalable CEFR classification.

Classifying German Language Proficiency Levels Using Large Language Models

TL;DR

This work assesses how Large Language Models can classify German texts into CEFR proficiency levels. By constructing a balanced, augmented dataset and evaluating prompting, fine-tuning, and probing strategies, it demonstrates substantial performance gains over traditional methods. The fine-tuned LLaMA-3-8B-Instruct model achieves a weighted F1 of 0.769 and perfect group accuracy, while probing the model’s internal states provides additional benefits. The study highlights the practical potential of LLM-based CEFR assessment and suggests avenues for synthetic data generation to expand labeled resources.

Abstract

Assessing language proficiency is essential for education, as it enables instruction tailored to learners needs. This paper investigates the use of Large Language Models (LLMs) for automatically classifying German texts according to the Common European Framework of Reference for Languages (CEFR) into different proficiency levels. To support robust training and evaluation, we construct a diverse dataset by combining multiple existing CEFR-annotated corpora with synthetic data. We then evaluate prompt-engineering strategies, fine-tuning of a LLaMA-3-8B-Instruct model and a probing-based approach that utilizes the internal neural state of the LLM for classification. Our results show a consistent performance improvement over prior methods, highlighting the potential of LLMs for reliable and scalable CEFR classification.

Paper Structure

This paper contains 22 sections, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Confusion matrices for different prompt engineering approaches using the LLaMA-3-8B-Instruct model: (a) English Base Prompt, (b) German Zero-Shot Prompt, (c) German Few-Shot Prompt. Each matrix visualizes predicted CEFR levels (columns) against true labels (rows), with cell shading indicating prediction density. The mean classification distances are: English Base Prompt = $1.120$, German Zero-Shot Prompt = $1.051$, German Few-Shot Prompt = $0.467$.
  • Figure 2: Performance comparison of different language models on CEFR classification, showing both exact accuracy and group accuracy (includes adjacent levels), sorted by Accuracy (names are model names as found on huggingface).
  • Figure 3: Confusion matrix for the neural network based classifier, highlighting a reduced confusion between adjacent CEFR levels.
  • Figure 4: Confusion matrix for the fine-tuned LLaMA-3-8B model, showing improved accuracy and reduced confusion between neighboring CEFR levels.
  • Figure 5: Training loss (blue), validation loss (orange), and text accuracy (red) during training of the model. The vertical line marks the training cutoff used.