Table of Contents
Fetching ...

Leveraging Small LLMs for Argument Mining in Education: Argument Component Identification, Classification, and Assessment

Lucile Favero, Juan Antonio Pérez-Ortiz, Tanja Käser, Nuria Oliver

TL;DR

The paper investigates using small open-source LLMs to perform a full argument mining pipeline (segmentation, argument type classification, and argument quality assessment) on student essays in education, with a focus on local, privacy-preserving deployment. It compares few-shot prompting and fine-tuning across three models (Qwen 2.5 7B, Llama 3.1 8B, Gemma 2 9B) against encoder baselines and GPT-4o mini on the Feedback Prize dataset. Results show that fine-tuned small LLMs outperform state-of-the-art encoders in segmentation and type classification, while few-shot prompting yields competitive results for quality assessment; joint modeling yields additional gains. The work demonstrates the practicality and privacy-preserving potential of open-source LLMs for real-time, personalized feedback on student writing, enabling scalable education tools on local devices.

Abstract

Argument mining algorithms analyze the argumentative structure of essays, making them a valuable tool for enhancing education by providing targeted feedback on the students' argumentation skills. While current methods often use encoder or encoder-decoder deep learning architectures, decoder-only models remain largely unexplored, offering a promising research direction. This paper proposes leveraging open-source, small Large Language Models (LLMs) for argument mining through few-shot prompting and fine-tuning. These models' small size and open-source nature ensure accessibility, privacy, and computational efficiency, enabling schools and educators to adopt and deploy them locally. Specifically, we perform three tasks: segmentation of student essays into arguments, classification of the arguments by type, and assessment of their quality. We empirically evaluate the models on the Feedback Prize - Predicting Effective Arguments dataset of grade 6-12 students essays and demonstrate how fine-tuned small LLMs outperform baseline methods in segmenting the essays and determining the argument types while few-shot prompting yields comparable performance to that of the baselines in assessing quality. This work highlights the educational potential of small, open-source LLMs to provide real-time, personalized feedback, enhancing independent learning and writing skills while ensuring low computational cost and privacy.

Leveraging Small LLMs for Argument Mining in Education: Argument Component Identification, Classification, and Assessment

TL;DR

The paper investigates using small open-source LLMs to perform a full argument mining pipeline (segmentation, argument type classification, and argument quality assessment) on student essays in education, with a focus on local, privacy-preserving deployment. It compares few-shot prompting and fine-tuning across three models (Qwen 2.5 7B, Llama 3.1 8B, Gemma 2 9B) against encoder baselines and GPT-4o mini on the Feedback Prize dataset. Results show that fine-tuned small LLMs outperform state-of-the-art encoders in segmentation and type classification, while few-shot prompting yields competitive results for quality assessment; joint modeling yields additional gains. The work demonstrates the practicality and privacy-preserving potential of open-source LLMs for real-time, personalized feedback on student writing, enabling scalable education tools on local devices.

Abstract

Argument mining algorithms analyze the argumentative structure of essays, making them a valuable tool for enhancing education by providing targeted feedback on the students' argumentation skills. While current methods often use encoder or encoder-decoder deep learning architectures, decoder-only models remain largely unexplored, offering a promising research direction. This paper proposes leveraging open-source, small Large Language Models (LLMs) for argument mining through few-shot prompting and fine-tuning. These models' small size and open-source nature ensure accessibility, privacy, and computational efficiency, enabling schools and educators to adopt and deploy them locally. Specifically, we perform three tasks: segmentation of student essays into arguments, classification of the arguments by type, and assessment of their quality. We empirically evaluate the models on the Feedback Prize - Predicting Effective Arguments dataset of grade 6-12 students essays and demonstrate how fine-tuned small LLMs outperform baseline methods in segmenting the essays and determining the argument types while few-shot prompting yields comparable performance to that of the baselines in assessing quality. This work highlights the educational potential of small, open-source LLMs to provide real-time, personalized feedback, enhancing independent learning and writing skills while ensuring low computational cost and privacy.

Paper Structure

This paper contains 50 sections, 1 equation, 5 figures, 16 tables.

Figures (5)

  • Figure 1: Overview of the proposed framework. Given an essay as input, the objective is to first segment it into arguments, then classify the argument types, and assess their quality using small open source LLMs ---Qwen 2.5 7B, Llama 3.1 8B, and Gemma 2 9B. These tasks are performed either individually or jointly through two learning approaches: few-shot prompting or fine-tuning.
  • Figure 2: Macro-averaged F1 scores [%] for the argument segmentation task across models. Comparison of small open-source models (Qwen 2.5 7B, Llama 3.1 8B, Gemma 2 9B) in the best few-shot (zero or three-shot) and fine-tuned (ft) settings with the baseline (Longformer) and GPT-4o mini, both with three-shot and fine-tuned. Error bars depict the standard deviation.
  • Figure 3: Macro-averaged F1 scores [ % ] for the argument type classification (left) and quality assessment (right) across models. Comparison of three small open-source models (Qwen 2.5 7B, Llama 3.1 8B, Gemma 2 9B) in the best few-shot (zero or three-shot) and fine-tuned (ft) settings with the baseline and GPT-4o mini (few-shot and fine-tuned). The results highlighted in transparent colors correspond to the evaluation with the gold segmentation whereas the darker colors correspond to inferred segmentation. In the case of gold segmentation, the baseline corresponds to a BERT model with two prediction heads. In the case of inferred segmentation, the segmentation is carried out by a Longformer followed by a classification with BERT. Circles represent the joint setup (both type and quality classification are performed at the same time) whereas triangles correspond to the individual setup (type and quality classification are performed separately). Error bars show the standard deviation.
  • Figure 4: Overlap, in %, with the gold segmentation and predicted segmentation across models. Comparison of small open-source models (Llama 3.1 8B, Qwen 2.5 7B, Gemma 2 9B) in the few-shot and fine-tuned (ft) settings with the baseline (Longformer) and GPT-4o mini few-shot and fine-tuned for the joint setup. Error bars correspond to the standard deviation.
  • Figure 5: Average number of arguments Comparison of small open-source models (Qwen 2.5 7B,Llama 3.1 8B, and Gemma 2 9B) in few-shot and fine-tuned (ft) settings with the baseline (Longformer) and GPT 4o min few-shot and fine-tuned for the joint setup. Error bars show the standard deviation.