Table of Contents
Fetching ...

Responsible Intelligence in Practice: A Fairness Audit of Open Large Language Models for Library Reference Services

Haining Wang, Jason Clark, Angelica Peña

TL;DR

This chapter applies a systematic evaluation approach that combines diagnostic classification to detect systematic differences with linguistic analysis to interpret their sources and discusses implications for responsible AI adoption in libraries and the importance of ongoing monitoring in aligning LLM-based services with core professional values.

Abstract

As libraries explore large language models (LLMs) as a scalable layer for reference services, a core fairness question follows: can LLM-based services support all patrons fairly, regardless of demographic identity? While LLMs offer great potential for broadening access to information assistance, they may also reproduce societal biases embedded in their training data, potentially undermining libraries' commitments to impartial service. In this chapter, we apply a systematic evaluation approach that combines diagnostic classification to detect systematic differences with linguistic analysis to interpret their sources. Across three widely used open models (Llama-3.1 8B, Gemma-2 9B, and Ministral 8B), we find no compelling evidence of systematic differentiation by race/ethnicity, and only minor evidence of sex-linked differentiation in one model. We discuss implications for responsible AI adoption in libraries and the importance of ongoing monitoring in aligning LLM-based services with core professional values.

Responsible Intelligence in Practice: A Fairness Audit of Open Large Language Models for Library Reference Services

TL;DR

This chapter applies a systematic evaluation approach that combines diagnostic classification to detect systematic differences with linguistic analysis to interpret their sources and discusses implications for responsible AI adoption in libraries and the importance of ongoing monitoring in aligning LLM-based services with core professional values.

Abstract

As libraries explore large language models (LLMs) as a scalable layer for reference services, a core fairness question follows: can LLM-based services support all patrons fairly, regardless of demographic identity? While LLMs offer great potential for broadening access to information assistance, they may also reproduce societal biases embedded in their training data, potentially undermining libraries' commitments to impartial service. In this chapter, we apply a systematic evaluation approach that combines diagnostic classification to detect systematic differences with linguistic analysis to interpret their sources. Across three widely used open models (Llama-3.1 8B, Gemma-2 9B, and Ministral 8B), we find no compelling evidence of systematic differentiation by race/ethnicity, and only minor evidence of sex-linked differentiation in one model. We discuss implications for responsible AI adoption in libraries and the importance of ongoing monitoring in aligning LLM-based services with core professional values.
Paper Structure (22 sections, 3 figures, 2 tables)

This paper contains 22 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Illustrative workflow of the study. (a) Simulation and corpus construction: we synthesize reference emails by combining query templates (academic and public) with demographic cues encoded in names, then generate responses from open LLMs under a librarian persona across multiple random seeds. (b) Fairness Evaluation Protocol (FEP): we extract TF--IDF features from responses and test whether demographic attributes can be predicted above chance using diagnostic classifiers (LogReg, MLP, XGBoost). If classification is not significant, we interpret this as no detectable systematic differentiation under the tested conditions; if significant, we fit a statistical logit model to identify salient lexical markers that drive the difference.
  • Figure 2: Summary of classification performance across three open LLMs in academic and public library settings for two demographic dimensions. Bars indicate classification margins above random chance for each diagnostic classifier: logistic regression (LogReg), MLP, and XGBoost. Margins are calculated as classification accuracy minus chance level, where chance levels are 50.0% for sex (2 groups) and 16.7% for race/ethnicity (6 groups). Asterisks (*) denote statistically significant deviations from chance after Bonferroni correction ($\alpha = 0.0056$ within each setting and demographic dimension).
  • Figure 3: Volcano plots visualizing the contribution of individual words to sex classification in Llama-3.1 outputs. Each point represents a word-level feature. The $x$-axis shows the coefficient from a statistical logistic regression model, and the $y$-axis shows the $-\log_{10}(p)$ value. Dashed lines mark the Bonferroni-adjusted significance threshold and magnitude requirements ($|\beta| \ge \log(2)$). Left panel: academic library setting. Right panel: public library setting. The term dear (marked in red in the academic panel) emerged as a significant predictor in the academic setting but not the public setting.