BRIDGE the Gap: Mitigating Bias Amplification in Automated Scoring of English Language Learners via Inter-group Data Augmentation

Yun Wang; Xuansheng Wu; Jingyuan Huang; Lei Liu; Xiaoming Zhai; Ninghao Liu

BRIDGE the Gap: Mitigating Bias Amplification in Automated Scoring of English Language Learners via Inter-group Data Augmentation

Yun Wang, Xuansheng Wu, Jingyuan Huang, Lei Liu, Xiaoming Zhai, Ninghao Liu

TL;DR

BRIDGE, a Bias-Reducing Inter-group Data GEneration framework designed for low-resource assessment settings, is proposed, offering a cost-effective solution for ensuring equitable scoring in large-scale assessments.

Abstract

In the field of educational assessment, automated scoring systems increasingly rely on deep learning and large language models (LLMs). However, these systems face significant risks of bias amplification, where model prediction gaps between student groups become larger than those observed in training data. This issue is especially severe for underrepresented groups such as English Language Learners (ELLs), as models may inherit and further magnify existing disparities in the data. We identify that this issue is closely tied to representation bias: the scarcity of minority (high-scoring ELL) samples makes models trained with empirical risk minimization favor majority (non-ELL) linguistic patterns. Consequently, models tend to under-predict ELL students who even demonstrate comparable domain knowledge but use different linguistic patterns, thereby undermining the fairness of automated scoring outcomes. To mitigate this, we propose BRIDGE, a Bias-Reducing Inter-group Data GEneration framework designed for low-resource assessment settings. Instead of relying on the limited minority samples, BRIDGE synthesizes high-scoring ELL samples by "pasting" construct-relevant (i.e., rubric-aligned knowledge and evidence) content from abundant high-scoring non-ELL samples into authentic ELL linguistic patterns. We further introduce a discriminator model to ensure the quality of synthetic samples. Experiments on California Science Test (CAST) datasets demonstrate that BRIDGE effectively reduces prediction bias for high-scoring ELL students while maintaining overall scoring performance. Notably, our method achieves fairness gains comparable to using additional real human data, offering a cost-effective solution for ensuring equitable scoring in large-scale assessments.

BRIDGE the Gap: Mitigating Bias Amplification in Automated Scoring of English Language Learners via Inter-group Data Augmentation

TL;DR

Abstract

Paper Structure (17 sections, 10 equations, 1 figure, 1 table)

This paper contains 17 sections, 10 equations, 1 figure, 1 table.

Introduction
Bias Amplification in Automated Scoring
Problem Formulation
Representation Bias under ERM
From Representation Bias to Bias Amplification
Mitigating Bias Amplification with BRIDGE
Stage 1: Inter-group Stylistic Reformulation
Stage 2: Discriminative Filtering for Authenticity
Experiments
Datasets
Experimental Setup
Results and Analysis
Quantification of Bias Amplification (RQ1)
Comparative Analysis of Mitigation Strategies (RQ2)
Trade-off between Fairness and Scoring Performance (RQ3)
...and 2 more sections

Figures (1)

Figure 1: Overview of the bias amplification loop and the BRIDGE framework. (a) Bias Propagation: Representation bias in training data creates a feedback loop that reinforces educational disparities. (b) ERM Vulnerability: Under ERM, the decision boundary for high scores is skewed by the majority non-ELL group, causing systematic underprediction of the sparse ELL high-score subgroup. (c) BRIDGE Framework: Our approach mitigates this by performing inter-group stylistic reformulation, extracting construct-relevant content from high-scoring non-ELL samples and injecting ELL-specific linguistic patterns, followed by a discriminator for quality control.

BRIDGE the Gap: Mitigating Bias Amplification in Automated Scoring of English Language Learners via Inter-group Data Augmentation

TL;DR

Abstract

BRIDGE the Gap: Mitigating Bias Amplification in Automated Scoring of English Language Learners via Inter-group Data Augmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (1)