Mitigating Hallucinations in Zero-Shot Scientific Summarisation: A Pilot Study
Imane Jaaouine, Ross D. King
TL;DR
This pilot study addresses hallucinations in zero-shot scientific summarisation by testing seven prompt methods across six instruction-tuned LLMs on eight yeast biotechnology abstracts. It evaluates lexical and semantic alignment with reference abstracts using ROUGE, BERTScore, METEOR, and cosine similarity, with statistical inference via BCa bootstrap and Wilcoxon tests. The results show that context repetition and random addition substantially improve lexical alignment, while increased instruction complexity does not reliably enhance semantic quality and may even hurt it. The findings demonstrate that prompt engineering offers a practical approach to mitigate context inconsistency in zero-shot scientific summarisation and point to future work on optimizing sentence-repetition strategy and semantic relevance.
Abstract
Large language models (LLMs) produce context inconsistency hallucinations, which are LLM generated outputs that are misaligned with the user prompt. This research project investigates whether prompt engineering (PE) methods can mitigate context inconsistency hallucinations in zero-shot LLM summarisation of scientific texts, where zero-shot indicates that the LLM relies purely on its pre-training data. Across eight yeast biotechnology research paper abstracts, six instruction-tuned LLMs were prompted with seven methods: a baseline prompt, two levels of increasing instruction complexity (PE-1 and PE-2), two levels of context repetition (CR-K1 and CR-K2), and two levels of random addition (RA-K1 and RA-K2). Context repetition involved the identification and repetition of K key sentences from the abstract, whereas random addition involved the repetition of K randomly selected sentences from the abstract, where K is 1 or 2. A total of 336 LLM-generated summaries were evaluated using six metrics: ROUGE-1, ROUGE-2, ROUGE-L, BERTScore, METEOR, and cosine similarity, which were used to compute the lexical and semantic alignment between the summaries and the abstracts. Four hypotheses on the effects of prompt methods on summary alignment with the reference text were tested. Statistical analysis on 3744 collected datapoints was performed using bias-corrected and accelerated (BCa) bootstrap confidence intervals and Wilcoxon signed-rank tests with Bonferroni-Holm correction. The results demonstrated that CR and RA significantly improve the lexical alignment of LLM-generated summaries with the abstracts. These findings indicate that prompt engineering has the potential to impact hallucinations in zero-shot scientific summarisation tasks.
