Table of Contents
Fetching ...

Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?

Hye Sun Yun, Karen Y. C. Zhang, Ramez Kouzy, Iain J. Marshall, Junyi Jessy Li, Byron C. Wallace

TL;DR

This study investigates whether Large Language Models (LLMs) are susceptible to spin in medical abstracts, a phenomenon known to bias clinician interpretation. By evaluating 22 LLMs on spin detection, interpretation of spun versus unspun trial results, and automatic simplification to plain language, the authors reveal that LLMs more readily embrace spin than humans, and can propagate it into downstream outputs. They show that targeted prompting strategies—especially joint spin detection and interpretation—significantly reduce this bias, offering practical mitigation for evidence synthesis tasks. The work highlights the need for careful prompt design and caution when deploying LLMs to summarize or simplify medical literature, particularly in oncology, where spin is prevalent and impactful.

Abstract

Medical research faces well-documented challenges in translating novel treatments into clinical practice. Publishing incentives encourage researchers to present "positive" findings, even when empirical results are equivocal. Consequently, it is well-documented that authors often spin study results, especially in article abstracts. Such spin can influence clinician interpretation of evidence and may affect patient care decisions. In this study, we ask whether the interpretation of trial results offered by Large Language Models (LLMs) is similarly affected by spin. This is important since LLMs are increasingly being used to trawl through and synthesize published medical evidence. We evaluated 22 LLMs and found that they are across the board more susceptible to spin than humans. They might also propagate spin into their outputs: We find evidence, e.g., that LLMs implicitly incorporate spin into plain language summaries that they generate. We also find, however, that LLMs are generally capable of recognizing spin, and can be prompted in a way to mitigate spin's impact on LLM outputs.

Caught in the Web of Words: Do LLMs Fall for Spin in Medical Literature?

TL;DR

This study investigates whether Large Language Models (LLMs) are susceptible to spin in medical abstracts, a phenomenon known to bias clinician interpretation. By evaluating 22 LLMs on spin detection, interpretation of spun versus unspun trial results, and automatic simplification to plain language, the authors reveal that LLMs more readily embrace spin than humans, and can propagate it into downstream outputs. They show that targeted prompting strategies—especially joint spin detection and interpretation—significantly reduce this bias, offering practical mitigation for evidence synthesis tasks. The work highlights the need for careful prompt design and caution when deploying LLMs to summarize or simplify medical literature, particularly in oncology, where spin is prevalent and impactful.

Abstract

Medical research faces well-documented challenges in translating novel treatments into clinical practice. Publishing incentives encourage researchers to present "positive" findings, even when empirical results are equivocal. Consequently, it is well-documented that authors often spin study results, especially in article abstracts. Such spin can influence clinician interpretation of evidence and may affect patient care decisions. In this study, we ask whether the interpretation of trial results offered by Large Language Models (LLMs) is similarly affected by spin. This is important since LLMs are increasingly being used to trawl through and synthesize published medical evidence. We evaluated 22 LLMs and found that they are across the board more susceptible to spin than humans. They might also propagate spin into their outputs: We find evidence, e.g., that LLMs implicitly incorporate spin into plain language summaries that they generate. We also find, however, that LLMs are generally capable of recognizing spin, and can be prompted in a way to mitigate spin's impact on LLM outputs.

Paper Structure

This paper contains 31 sections, 1 equation, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Authors of medical articles sometimes spin their reporting of trial results. We find that LLMs are susceptible to this when "reading" medical abstracts, more so than human experts.
  • Figure 2: Spin detection task accuracies for all LLMs. The average accuracy of all models was 0.67 (solid red vertical line), well above the random baseline (gray dashed vertical line). That said, this plot shows considerable variance across models with respect to their spin detection capabilities.
  • Figure 3: Average mean differences of scores from LLMs for all 5 interpretation questions compared to human experts. Error bars indicate 95% confidence intervals. A positive mean difference indicates that LLMs interpreted the spun abstract as showing more favorable treatment results while the negative mean difference indicates unspun abstracts to be more favorable. This plot suggests that LLMs, in general, erroneously infer larger differences in results between spun and unspun abstracts than do human experts. fig:spin_label_regression_benefitFigureFigures explains the effect of spin for each LLM.
  • Figure 4: Coefficients from linear regression models with 95% CI for each LLM showing how much different LLMs overestimate the treatment effects (benefit of treatment), when abstracts contain 'spin'. In comparison with human experts (0.71), all LLMs were more susceptible to spin. AlpaCare 7B and Olmo2 Instruct 13B were the most susceptible to spin than others.
  • Figure 5: Average mean differences of scores from Claude 3.5 Sonnet, GPT-4o Mini, and OpenBioLLM 70B interpreting simplified versions of abstracts with and without spin generated by 22 LLMs. The error bars indicate 95% confidence intervals. This plot shows that simplified spun abstracts generated by LLMs also exhibit spin.
  • ...and 7 more figures