Table of Contents
Fetching ...

Fairness Evaluation of Large Language Models in Academic Library Reference Services

Haining Wang, Jason Clark, Yueru Yan, Star Bradley, Ruiyang Chen, Yiqiong Zhang, Hengyi Fu, Zuoyu Tian

TL;DR

This study analyzes whether large language-mediated academic library reference services treat patrons equitably across sex, race/ethnicity, and institutional role. It introduces the Fairness Evaluation Protocol (FEP), a two-phase, model-agnostic audit using diagnostic classifiers on TF-IDF features across six state-of-the-art LLMs (three commercial and three open), with synthetic patron data crafted to reflect realistic library interactions. The results show demographic neutrality across race/ethnicity and sex (with a single minor exception in one model) and clear, role-based accommodation signals (formality, self-identification as librarians, and domain-specific vocabulary) aligned with professional norms rather than bias. The authors advocate using FEP as a recurring evaluation tool for equitable AI-enabled library services and propose future work to expand demographic scope, dialect variation, and real-world deployments beyond academia.

Abstract

As libraries explore large language models (LLMs) for use in virtual reference services, a key question arises: Can LLMs serve all users equitably, regardless of demographics or social status? While they offer great potential for scalable support, LLMs may also reproduce societal biases embedded in their training data, risking the integrity of libraries' commitment to equitable service. To address this concern, we evaluate whether LLMs differentiate responses across user identities by prompting six state-of-the-art LLMs to assist patrons differing in sex, race/ethnicity, and institutional role. We find no evidence of differentiation by race or ethnicity, and only minor evidence of stereotypical bias against women in one model. LLMs demonstrate nuanced accommodation of institutional roles through the use of linguistic choices related to formality, politeness, and domain-specific vocabularies, reflecting professional norms rather than discriminatory treatment. These findings suggest that current LLMs show a promising degree of readiness to support equitable and contextually appropriate communication in academic library reference services.

Fairness Evaluation of Large Language Models in Academic Library Reference Services

TL;DR

This study analyzes whether large language-mediated academic library reference services treat patrons equitably across sex, race/ethnicity, and institutional role. It introduces the Fairness Evaluation Protocol (FEP), a two-phase, model-agnostic audit using diagnostic classifiers on TF-IDF features across six state-of-the-art LLMs (three commercial and three open), with synthetic patron data crafted to reflect realistic library interactions. The results show demographic neutrality across race/ethnicity and sex (with a single minor exception in one model) and clear, role-based accommodation signals (formality, self-identification as librarians, and domain-specific vocabulary) aligned with professional norms rather than bias. The authors advocate using FEP as a recurring evaluation tool for equitable AI-enabled library services and propose future work to expand demographic scope, dialect variation, and real-world deployments beyond academia.

Abstract

As libraries explore large language models (LLMs) for use in virtual reference services, a key question arises: Can LLMs serve all users equitably, regardless of demographics or social status? While they offer great potential for scalable support, LLMs may also reproduce societal biases embedded in their training data, risking the integrity of libraries' commitment to equitable service. To address this concern, we evaluate whether LLMs differentiate responses across user identities by prompting six state-of-the-art LLMs to assist patrons differing in sex, race/ethnicity, and institutional role. We find no evidence of differentiation by race or ethnicity, and only minor evidence of stereotypical bias against women in one model. LLMs demonstrate nuanced accommodation of institutional roles through the use of linguistic choices related to formality, politeness, and domain-specific vocabularies, reflecting professional norms rather than discriminatory treatment. These findings suggest that current LLMs show a promising degree of readiness to support equitable and contextually appropriate communication in academic library reference services.

Paper Structure

This paper contains 12 sections, 5 figures, 14 tables.

Figures (5)

  • Figure 1: Summary of classification performance across six LLMs and three demographic dimensions. Bars indicate classification margins above random chance for each diagnostic classifier: logistic regression (LogReg), MLP, and XGBoost). Margins are calculated as classification accuracy minus chance level, where chance levels are 16.7% for race/ethnicity (6 groups), 50.0% for sex (2 groups), and 16.7% for patron type (6 groups). Asterisks (*) denote statistically significant deviations from chance after Bonferroni correction ($\alpha = 0.0028$).
  • Figure 2: Volcano plot visualizing the contribution of individual words to sex classification in Llama-3.1 outputs. Each point represents a word-level feature. The $x$-axis shows the coefficient from a statistical logistic regression model, and the $y$-axis shows the $-\log_{10}(p)$ value. Dashed lines mark the Bonferroni-adjusted significance threshold and magnitude requirements. The term dear (marked in red) emerged as the only significant predictor after Bonferroni correction. Additional terms (e.g., thank and research) with notable $p$-values or coefficient magnitudes that did not meet our dual criteria are also labelled.
  • Figure 3: Hierarchical clustering analysis of linguistic features discriminating patron types across LLMs. Salient words are clustered by similarity in discriminative patterns across models using Ward linkage. Dendrograms show feature relationships, with heatmaps displaying normalised coefficient magnitudes for consensus features (significant in $\geq$2 models). Darker colours indicate stronger discriminative power.
  • Figure 4: Cross-LLM variation in patron type accommodation for three key linguistic features. Each radar plot displays individual LLM coefficients for thank, dear, and research across patron types, relative to undergraduate students. Positive values indicate higher likelihood of feature usage; negative values indicate reduced usage.
  • Figure 5: Volcano plots for sex classification at temperatures 0.0 and 0.3. Each plot shows feature importance (x-axis: log-odds coefficients) versus statistical significance (y-axis: $-\log_{10}(p)$). The salutation "dear" remains the only statistically significant feature after correction across both temperature settings, consistent with findings at $T=0.7$.