Table of Contents
Fetching ...

LLMs are Biased Teachers: Evaluating LLM Bias in Personalized Education

Iain Weissburg, Sathvika Anand, Sharon Levy, Haewon Jeong

TL;DR

This work evaluates how large language models behave as personalized teachers, focusing on biases across demographic groups in both selection and generation of educational content. It introduces two bias metrics, $MAB$ and $MDB$, and conducts a large-scale study across nine frontier LLMs using over 17,000 explanations drawn from multiple datasets and topics. Findings show pervasive biases related to race/ethnicity, sex/gender, disability, income, and other attributes, with highest bias for income and disability and the lowest for sex/gender and race/ethnicity; biases persist across teacher and student roles, and across ranking and generative tasks. The paper highlights significant practical implications for education technology, ethical considerations, and the need for fairness-aware designs and mitigation strategies when deploying LLM-based tutors.

Abstract

With the increasing adoption of large language models (LLMs) in education, concerns about inherent biases in these models have gained prominence. We evaluate LLMs for bias in the personalized educational setting, specifically focusing on the models' roles as "teachers." We reveal significant biases in how models generate and select educational content tailored to different demographic groups, including race, ethnicity, sex, gender, disability status, income, and national origin. We introduce and apply two bias score metrics--Mean Absolute Bias (MAB) and Maximum Difference Bias (MDB)--to analyze 9 open and closed state-of-the-art LLMs. Our experiments, which utilize over 17,000 educational explanations across multiple difficulty levels and topics, uncover that models potentially harm student learning by both perpetuating harmful stereotypes and reversing them. We find that bias is similar for all frontier models, with the highest MAB along income levels while MDB is highest relative to both income and disability status. For both metrics, we find the lowest bias exists for sex/gender and race/ethnicity.

LLMs are Biased Teachers: Evaluating LLM Bias in Personalized Education

TL;DR

This work evaluates how large language models behave as personalized teachers, focusing on biases across demographic groups in both selection and generation of educational content. It introduces two bias metrics, and , and conducts a large-scale study across nine frontier LLMs using over 17,000 explanations drawn from multiple datasets and topics. Findings show pervasive biases related to race/ethnicity, sex/gender, disability, income, and other attributes, with highest bias for income and disability and the lowest for sex/gender and race/ethnicity; biases persist across teacher and student roles, and across ranking and generative tasks. The paper highlights significant practical implications for education technology, ethical considerations, and the need for fairness-aware designs and mitigation strategies when deploying LLM-based tutors.

Abstract

With the increasing adoption of large language models (LLMs) in education, concerns about inherent biases in these models have gained prominence. We evaluate LLMs for bias in the personalized educational setting, specifically focusing on the models' roles as "teachers." We reveal significant biases in how models generate and select educational content tailored to different demographic groups, including race, ethnicity, sex, gender, disability status, income, and national origin. We introduce and apply two bias score metrics--Mean Absolute Bias (MAB) and Maximum Difference Bias (MDB)--to analyze 9 open and closed state-of-the-art LLMs. Our experiments, which utilize over 17,000 educational explanations across multiple difficulty levels and topics, uncover that models potentially harm student learning by both perpetuating harmful stereotypes and reversing them. We find that bias is similar for all frontier models, with the highest MAB along income levels while MDB is highest relative to both income and disability status. For both metrics, we find the lowest bias exists for sex/gender and race/ethnicity.

Paper Structure

This paper contains 61 sections, 8 equations, 55 figures, 4 tables.

Figures (55)

  • Figure 1: A diagram of the ranking experiment setting.
  • Figure 2: Normalized scores \ref{['eq:z']} and 95% bootstrapping CI for the ranking and generative tasks on GPT 4o; closer to 0 is better. For both, we observe significant bias across all subgroups. For (a), stereotypical biases occur in Sex/Gender and reverse biases in Disability. For (b), we observe the opposite, with stereotypical biases in Disability and reverse biases in Sex/Gender. In plot (a), see Section \ref{['sec:rq1']} for insight into the Reference patterns. For (b), strong ordering in our Reference characteristics (bottom) indicates alignment between the model outputs and scoring strategy.
  • Figure 3: Bias scores for each demographic subgroup as described in Section \ref{['sec:metrics']}; lower is better. (a) and (b) are for each model, averaged across tasks/datasets. (c) and (d) are averaged across models for each task/dataset ("Generative" is the generative task, and others are the ranking datasets). For all primary models, the Income subgroup shows the highest average bias. When we consider the maximum difference bias, Disability and Religion also show very high bias. Note that this only shows the magnitude of bias, which can be stereotypical, reverse, or mixed. See Appendix \ref{['app:models']} for a list of every experiment.
  • Figure 4: The Mean Choice Value (MCV) and 95% bootstrapping CI for GPT 4o comparing the student and teacher roles using selected subgroups that show the most bias patterns; closer to 0 is better. We observe that the bias trends in the student role tend to mirror those in the teacher role, although there is some variation.
  • Figure 5: The normalized scores \ref{['eq:z']} and 95% bootstrapping CI of GPT 4o for MATH and topic modeling, selected subgroups of interest; closer to 0 is better. In (b), we observe that the subject of the articles does not seem to have much of an effect on the models' choices. Most variation falls within the error bars, and trends follow the overall News In Levels experiments. In (a), we observe similar bias patterns across most categories despite the lack of linguistic features. In both figures, we observe similar instances of reverse bias (e.g. Disability)
  • ...and 50 more figures