Table of Contents
Fetching ...

LLM4SCREENLIT: Recommendations on Assessing the Performance of Large Language Models for Screening Literature in Systematic Reviews

Lech Madeyski, Barbara Kitchenham, Martin Shepperd

TL;DR

This work addresses the evaluation challenges of using LLMs to screen literature for systematic reviews. By re-analyzing the DC+ study and 27 related papers, it shows that accuracy and other traditional metrics are unreliable on imbalanced SR data and that Lost Evidence (missed relevant studies) is a critical risk. It introduces Weighted MCC (WMCC) as a principled, cost-sensitive extension of MCC to account for asymmetric misclassification costs, and advocates for reporting complete confusion matrices, leakage-aware designs, and open artifacts to enable meta-analyses. The paper provides concrete recommendations for researchers, practitioners, and publishers to standardize evaluation practices, thereby improving the credibility and utility of LLM-supported SR screening across domains.

Abstract

Context: Large language models (LLMs) are released faster than users' ability to evaluate them rigorously. When LLMs underpin research, such as identifying relevant literature for systematic reviews (SRs), robust empirical assessment is essential. Objective: We identify and discuss key challenges in assessing LLM performance for selecting relevant literature, identify good (evaluation) practices, and propose recommendations. Method: Using a recent large-scale study as an example, we identify problems with the use of traditional metrics for assessing the performance of Gen-AI tools for identifying relevant literature in SRs. We analyzed 27 additional papers investigating this issue, extracted the performance metrics, and found both good practices and widespread problems, especially with the use and reporting of performance metrics for SR screening. Results: Major weaknesses included: i) a failure to use metrics that are robust to imbalanced data and do not directly indicate whether results are better than chance, e.g., the use of Accuracy, ii) a failure to consider the impact of lost evidence when making claims concerning workload savings, and iii) pervasive failure to report the full confusion matrix (or performance metrics from which it can be reconstructed) which is essential for future meta-analyses. On the positive side, we extract good (evaluation) practices on which our recommendations for researchers and practitioners, as well as policymakers, are built. Conclusions: SR screening evaluations should prioritize lost evidence/recall alongside chance-anchored and cost-sensitive Weighted MCC (WMCC) metric, report complete confusion matrices, treat unclassifiable outputs as referred-back positives for assessment, adopt leakage-aware designs with non-LLM baselines and open artifacts, and ground conclusions in cost-benefit analysis where FNs carry higher penalties than FPs.

LLM4SCREENLIT: Recommendations on Assessing the Performance of Large Language Models for Screening Literature in Systematic Reviews

TL;DR

This work addresses the evaluation challenges of using LLMs to screen literature for systematic reviews. By re-analyzing the DC+ study and 27 related papers, it shows that accuracy and other traditional metrics are unreliable on imbalanced SR data and that Lost Evidence (missed relevant studies) is a critical risk. It introduces Weighted MCC (WMCC) as a principled, cost-sensitive extension of MCC to account for asymmetric misclassification costs, and advocates for reporting complete confusion matrices, leakage-aware designs, and open artifacts to enable meta-analyses. The paper provides concrete recommendations for researchers, practitioners, and publishers to standardize evaluation practices, thereby improving the credibility and utility of LLM-supported SR screening across domains.

Abstract

Context: Large language models (LLMs) are released faster than users' ability to evaluate them rigorously. When LLMs underpin research, such as identifying relevant literature for systematic reviews (SRs), robust empirical assessment is essential. Objective: We identify and discuss key challenges in assessing LLM performance for selecting relevant literature, identify good (evaluation) practices, and propose recommendations. Method: Using a recent large-scale study as an example, we identify problems with the use of traditional metrics for assessing the performance of Gen-AI tools for identifying relevant literature in SRs. We analyzed 27 additional papers investigating this issue, extracted the performance metrics, and found both good practices and widespread problems, especially with the use and reporting of performance metrics for SR screening. Results: Major weaknesses included: i) a failure to use metrics that are robust to imbalanced data and do not directly indicate whether results are better than chance, e.g., the use of Accuracy, ii) a failure to consider the impact of lost evidence when making claims concerning workload savings, and iii) pervasive failure to report the full confusion matrix (or performance metrics from which it can be reconstructed) which is essential for future meta-analyses. On the positive side, we extract good (evaluation) practices on which our recommendations for researchers and practitioners, as well as policymakers, are built. Conclusions: SR screening evaluations should prioritize lost evidence/recall alongside chance-anchored and cost-sensitive Weighted MCC (WMCC) metric, report complete confusion matrices, treat unclassifiable outputs as referred-back positives for assessment, adopt leakage-aware designs with non-LLM baselines and open artifacts, and ground conclusions in cost-benefit analysis where FNs carry higher penalties than FPs.

Paper Structure

This paper contains 17 sections, 15 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Lost Evidence per Model for three SRs (the median of Lost Evidence is presented as a point) and the min/max show the extremes)
  • Figure 2: Distribution of evaluation metrics used across 27 papers analyzing Gen-AI tools for systematic review screening
  • Figure 3: Adoption of good practices
  • Figure 4: Subsampling stability with 100 to 500 observations