Privacy-Preserved Automated Scoring using Federated Learning for Educational Research
Ehsan Latif, Xiaoming Zhai
TL;DR
This work addresses privacy concerns in educational data by introducing a privacy-preserving federated learning framework for automated scoring that leverages parameter-efficient fine-tuning of large language models using LoRA. It adds an adaptive weighted aggregation scheme and differential privacy to securely combine locally trained updates across institutions, alongside a secure gRPC-based communication layer. Evaluated on NGSS-aligned, multi-label science tasks from nine middle schools, the approach achieves 94.5% accuracy in the federated setting, closely approaching centralized performance (within 0.5–1.0 percentage points) and delivering strong rubric-level scoring accuracy with a mean absolute error of 0.34. The results demonstrate that privacy-preserving FL can deliver high-quality, interpretable automated scoring at scale with reduced data moved and maintained regulatory compliance, supported by open-source reproducibility.
Abstract
Data privacy remains a critical concern in educational research, requiring strict adherence to ethical standards and regulatory protocols. While traditional approaches rely on anonymization and centralized data collection, they often expose raw student data to security vulnerabilities and impose substantial logistical overhead. In this study, we propose a federated learning (FL) framework for automated scoring of educational assessments that eliminates the need to share sensitive data across institutions. Our approach leverages parameter-efficient fine-tuning of large language models (LLMs) with Low-Rank Adaptation (LoRA), enabling each client (school) to train locally while sharing only optimized model updates. To address data heterogeneity, we implement an adaptive weighted aggregation strategy that considers both client performance and data volume. We benchmark our model against two state-of-the-art FL methods and a centralized learning baseline using NGSS-aligned multi-label science assessment data from nine middle schools. Results show that our model achieves the highest accuracy (94.5%) among FL approaches, and performs within 0.5-1.0 percentage points of the centralized model on these metrics. Additionally, it achieves comparable rubric-level scoring accuracy, with only a 1.3% difference in rubric match and a lower score deviation (MAE), highlighting its effectiveness in preserving both prediction quality and interpretability.
