Table of Contents
Fetching ...

Whose Name Comes Up? Benchmarking and Intervention-Based Auditing of LLM-Based Scholar Recommendation

Lisette Espin-Noboa, Gonzalo Gabriel Mendez

TL;DR

LLMScholarBench introduces a reproducible benchmark for auditing LLM-based scholar recommendations that jointly considers model infrastructure and inference-time interventions. By evaluating 22 LLMs across physics tasks and applying three end-user controls (temperature, representation-constrained prompting, and RAG with web search), the study reveals systematic trade-offs rather than universal improvements; larger, proprietary, and reasoning-enabled models tend to boost factuality but may reduce validity and diversity, while inference-time controls predominantly redistribute errors across technical and representational dimensions. Grounded in APS/OpenAlex data, the benchmark provides standardized metrics capturing both technical quality and social representation, including coauthorship connectivity, bibliometric similarity, and parity across perceived demographic attributes. The findings emphasize that deployment choices shape socio-technical outcomes and that no single configuration universally excels, underscoring the need for auditable, modular pipelines and explicit representation goals in scholarly recommendation systems. The authors accompany the work with open-source code and data to facilitate cross-domain adaptation and further methodological development in audit benchmarks.

Abstract

Large language models (LLMs) are increasingly used for academic expert recommendation. Existing audits typically evaluate model outputs in isolation, largely ignoring end-user inference-time interventions. As a result, it remains unclear whether failures such as refusals, hallucinations, and uneven coverage stem from model choice or deployment decisions. We introduce LLMScholarBench, a benchmark for auditing LLM-based scholar recommendation that jointly evaluates model infrastructure and end-user interventions across multiple tasks. LLMScholarBench measures both technical quality and social representation using nine metrics. We instantiate the benchmark in physics expert recommendation and audit 22 LLMs under temperature variation, representation-constrained prompting, and retrieval-augmented generation (RAG) via web search. Our results show that end-user interventions do not yield uniform improvements but instead redistribute error across dimensions. Higher temperature degrades validity, consistency, and factuality. Representation-constrained prompting improves diversity at the expense of factuality, while RAG primarily improves technical quality while reducing diversity and parity. Overall, end-user interventions reshape trade-offs rather than providing a general fix. We release code and data that can be adapted to other disciplines by replacing domain-specific ground truth and metrics.

Whose Name Comes Up? Benchmarking and Intervention-Based Auditing of LLM-Based Scholar Recommendation

TL;DR

LLMScholarBench introduces a reproducible benchmark for auditing LLM-based scholar recommendations that jointly considers model infrastructure and inference-time interventions. By evaluating 22 LLMs across physics tasks and applying three end-user controls (temperature, representation-constrained prompting, and RAG with web search), the study reveals systematic trade-offs rather than universal improvements; larger, proprietary, and reasoning-enabled models tend to boost factuality but may reduce validity and diversity, while inference-time controls predominantly redistribute errors across technical and representational dimensions. Grounded in APS/OpenAlex data, the benchmark provides standardized metrics capturing both technical quality and social representation, including coauthorship connectivity, bibliometric similarity, and parity across perceived demographic attributes. The findings emphasize that deployment choices shape socio-technical outcomes and that no single configuration universally excels, underscoring the need for auditable, modular pipelines and explicit representation goals in scholarly recommendation systems. The authors accompany the work with open-source code and data to facilitate cross-domain adaptation and further methodological development in audit benchmarks.

Abstract

Large language models (LLMs) are increasingly used for academic expert recommendation. Existing audits typically evaluate model outputs in isolation, largely ignoring end-user inference-time interventions. As a result, it remains unclear whether failures such as refusals, hallucinations, and uneven coverage stem from model choice or deployment decisions. We introduce LLMScholarBench, a benchmark for auditing LLM-based scholar recommendation that jointly evaluates model infrastructure and end-user interventions across multiple tasks. LLMScholarBench measures both technical quality and social representation using nine metrics. We instantiate the benchmark in physics expert recommendation and audit 22 LLMs under temperature variation, representation-constrained prompting, and retrieval-augmented generation (RAG) via web search. Our results show that end-user interventions do not yield uniform improvements but instead redistribute error across dimensions. Higher temperature degrades validity, consistency, and factuality. Representation-constrained prompting improves diversity at the expense of factuality, while RAG primarily improves technical quality while reducing diversity and parity. Overall, end-user interventions reshape trade-offs rather than providing a general fix. We release code and data that can be adapted to other disciplines by replacing domain-specific ground truth and metrics.
Paper Structure (27 sections, 12 equations, 22 figures, 3 tables)

This paper contains 27 sections, 12 equations, 22 figures, 3 tables.

Figures (22)

  • Figure 1: Infrastructure-level performance. Mean values ($\pm95\%$ CI) aggregated by model access, model size, and reasoning capability. Bold values indicate best-in-group performance for metrics with a clear directional preference (arrows indicate whether higher or lower is better). The results show clear trade-offs across infrastructure groups, indicating that access, scale, and reasoning design favor different outcomes depending on the evaluation criterion.
  • Figure 2: Effect of temperature on performance. Mean values ($\pm95\%$ CI) across sampling temperatures, aggregated by model access, model size, and reasoning capability. Higher temperatures generally reduce most technical metrics, with pronounced thresholds in outcomes such as validity, indicating that temperature amplifies trade-offs across infrastructure groups. Proprietary models show lower sensitivity to temperature variation and more stable metric trends than other infrastructure groups.
  • Figure 3: Effects of gender-constrained prompting on top-100 expert recommendation lists (averaged across all models). Each panel shows the mean metric value (±95% CI) before (B) and after (A) applying the constraint. Enforcing balanced gender representation mainly increases gender diversity with little change in gender parity, but reduces factuality and similarity. Female-only prompts produce the lowest factuality, similarity, and gender parity, while yielding the highest ethnicity diversity.
  • Figure 4: Effect of RAG web search on performance across tasks for gemini models. Panels show mean metric values ($\pm95\%$ CI) before (B) and after (A) enabling RAG. Flash (top row) shows a larger drop in validity under RAG across most tasks, whereas Pro (bottom row) is comparatively less affected. Duplicates remain near zero for both, factuality stays high, and changes in connectedness, similarity, and representation metrics (diversity/parity) are smaller and more task-dependent.
  • Figure A.1: Prompt template. The template specifies the task, step-by-step instructions, and a structured JSON output format. The criteria field is instantiated according to the task scenario (e.g., top-$k$, field, epoch, or seniority). The backup_indicator explicitly requests task-dependent attributes to be returned for each recommended scholar, which are later used to assess factual accuracy. The output_example illustrates the expected JSON structure corresponding to the requested indicators.
  • ...and 17 more figures