Table of Contents
Fetching ...

Mitigating Hallucinations in Zero-Shot Scientific Summarisation: A Pilot Study

Imane Jaaouine, Ross D. King

TL;DR

This pilot study addresses hallucinations in zero-shot scientific summarisation by testing seven prompt methods across six instruction-tuned LLMs on eight yeast biotechnology abstracts. It evaluates lexical and semantic alignment with reference abstracts using ROUGE, BERTScore, METEOR, and cosine similarity, with statistical inference via BCa bootstrap and Wilcoxon tests. The results show that context repetition and random addition substantially improve lexical alignment, while increased instruction complexity does not reliably enhance semantic quality and may even hurt it. The findings demonstrate that prompt engineering offers a practical approach to mitigate context inconsistency in zero-shot scientific summarisation and point to future work on optimizing sentence-repetition strategy and semantic relevance.

Abstract

Large language models (LLMs) produce context inconsistency hallucinations, which are LLM generated outputs that are misaligned with the user prompt. This research project investigates whether prompt engineering (PE) methods can mitigate context inconsistency hallucinations in zero-shot LLM summarisation of scientific texts, where zero-shot indicates that the LLM relies purely on its pre-training data. Across eight yeast biotechnology research paper abstracts, six instruction-tuned LLMs were prompted with seven methods: a baseline prompt, two levels of increasing instruction complexity (PE-1 and PE-2), two levels of context repetition (CR-K1 and CR-K2), and two levels of random addition (RA-K1 and RA-K2). Context repetition involved the identification and repetition of K key sentences from the abstract, whereas random addition involved the repetition of K randomly selected sentences from the abstract, where K is 1 or 2. A total of 336 LLM-generated summaries were evaluated using six metrics: ROUGE-1, ROUGE-2, ROUGE-L, BERTScore, METEOR, and cosine similarity, which were used to compute the lexical and semantic alignment between the summaries and the abstracts. Four hypotheses on the effects of prompt methods on summary alignment with the reference text were tested. Statistical analysis on 3744 collected datapoints was performed using bias-corrected and accelerated (BCa) bootstrap confidence intervals and Wilcoxon signed-rank tests with Bonferroni-Holm correction. The results demonstrated that CR and RA significantly improve the lexical alignment of LLM-generated summaries with the abstracts. These findings indicate that prompt engineering has the potential to impact hallucinations in zero-shot scientific summarisation tasks.

Mitigating Hallucinations in Zero-Shot Scientific Summarisation: A Pilot Study

TL;DR

This pilot study addresses hallucinations in zero-shot scientific summarisation by testing seven prompt methods across six instruction-tuned LLMs on eight yeast biotechnology abstracts. It evaluates lexical and semantic alignment with reference abstracts using ROUGE, BERTScore, METEOR, and cosine similarity, with statistical inference via BCa bootstrap and Wilcoxon tests. The results show that context repetition and random addition substantially improve lexical alignment, while increased instruction complexity does not reliably enhance semantic quality and may even hurt it. The findings demonstrate that prompt engineering offers a practical approach to mitigate context inconsistency in zero-shot scientific summarisation and point to future work on optimizing sentence-repetition strategy and semantic relevance.

Abstract

Large language models (LLMs) produce context inconsistency hallucinations, which are LLM generated outputs that are misaligned with the user prompt. This research project investigates whether prompt engineering (PE) methods can mitigate context inconsistency hallucinations in zero-shot LLM summarisation of scientific texts, where zero-shot indicates that the LLM relies purely on its pre-training data. Across eight yeast biotechnology research paper abstracts, six instruction-tuned LLMs were prompted with seven methods: a baseline prompt, two levels of increasing instruction complexity (PE-1 and PE-2), two levels of context repetition (CR-K1 and CR-K2), and two levels of random addition (RA-K1 and RA-K2). Context repetition involved the identification and repetition of K key sentences from the abstract, whereas random addition involved the repetition of K randomly selected sentences from the abstract, where K is 1 or 2. A total of 336 LLM-generated summaries were evaluated using six metrics: ROUGE-1, ROUGE-2, ROUGE-L, BERTScore, METEOR, and cosine similarity, which were used to compute the lexical and semantic alignment between the summaries and the abstracts. Four hypotheses on the effects of prompt methods on summary alignment with the reference text were tested. Statistical analysis on 3744 collected datapoints was performed using bias-corrected and accelerated (BCa) bootstrap confidence intervals and Wilcoxon signed-rank tests with Bonferroni-Holm correction. The results demonstrated that CR and RA significantly improve the lexical alignment of LLM-generated summaries with the abstracts. These findings indicate that prompt engineering has the potential to impact hallucinations in zero-shot scientific summarisation tasks.

Paper Structure

This paper contains 51 sections, 21 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Visualisation of the experimental workflow. Each research paper abstract is inserted, and then the LLM and prompt method are selected. Each abstract is then summarised using seven prompt methods.
  • Figure 2: Median paired differences, $\Delta_{xyz}^{(m)}$, between each prompt method and the baseline across six evaluation metrics. Each cell is shaded based on the median $\Delta_{xyz}^{(m)}$ score for each combination of prompt method, evaluation metric, and reference text. Positive magnitudes are visualised in green, indicating performance improvement, and negative magnitudes are presented in purple, indicating performance decline. Asterisks identify the combinations with statistically significant performance differences, according to the BC$_a$ bootstrap confidence interval test and Bonferroni-Holm corrected Wilcoxon signed-rank test, where * = $p < 0.05$, ** = $p < 0.01$, *** = $p < 0.001$.
  • Figure 3: Performance comparison of prompt engineering methods relative to the baseline across six evaluation metrics, evaluated against the abstract text and key sentence references. The first row of plots visualises the impact of the first and second levels of prompt engineering methods on the alignment of the summary with the abstract text. The following four rows of plots show the comparison between the baseline, context repetition, and random addition across six evaluation metrics against the abstract text and $K$ key sentences for $K\!\in\!\{1,2\}$. Asterisks identify combinations that were significant across both the corrected Wilcoxon signed rank and the BC$_a$ bootstrap confidence interval tests.