Table of Contents
Fetching ...

How to Detect and Defeat Molecular Mirage: A Metric-Driven Benchmark for Hallucination in LLM-based Molecular Comprehension

Hao Li, Liuzhenghao Lv, He Cao, Zijing Liu, Zhiyuan Yan, Yu Wang, Yonghong Tian, Yu Li, Li Yuan

TL;DR

The paper tackles hallucination in LLM-based molecular understanding by introducing Mol-Hallu, a free-form evaluation metric that quantifies entity-level entailment between generated text, ground truth, and molecular descriptions. It identifies bio-knowledge shortcuts in PubChemQA as a key hallucination source and pairs Mol-Hallu with Hallucination Reduction Post-processing (HRPP) to mitigate these errors via entity-masking and Direct Preference Optimization. Empirical results show Mol-Hallu better captures semantic biotechnology errors than traditional metrics and that HRPP consistently reduces hallucinations across decoder-only and encoder-decoder models, improving reliability for molecular design and analysis. The work provides practical, scalable tools to assess and reduce hallucination in scientific LLMs, with implications for drug discovery and cheminformatics pipelines.

Abstract

Large language models are increasingly used in scientific domains, especially for molecular understanding and analysis. However, existing models are affected by hallucination issues, resulting in errors in drug design and utilization. In this paper, we first analyze the sources of hallucination in LLMs for molecular comprehension tasks, specifically the knowledge shortcut phenomenon observed in the PubChem dataset. To evaluate hallucination in molecular comprehension tasks with computational efficiency, we introduce \textbf{Mol-Hallu}, a novel free-form evaluation metric that quantifies the degree of hallucination based on the scientific entailment relationship between generated text and actual molecular properties. Utilizing the Mol-Hallu metric, we reassess and analyze the extent of hallucination in various LLMs performing molecular comprehension tasks. Furthermore, the Hallucination Reduction Post-processing stage~(HRPP) is proposed to alleviate molecular hallucinations, Experiments show the effectiveness of HRPP on decoder-only and encoder-decoder molecular LLMs. Our findings provide critical insights into mitigating hallucination and improving the reliability of LLMs in scientific applications.

How to Detect and Defeat Molecular Mirage: A Metric-Driven Benchmark for Hallucination in LLM-based Molecular Comprehension

TL;DR

The paper tackles hallucination in LLM-based molecular understanding by introducing Mol-Hallu, a free-form evaluation metric that quantifies entity-level entailment between generated text, ground truth, and molecular descriptions. It identifies bio-knowledge shortcuts in PubChemQA as a key hallucination source and pairs Mol-Hallu with Hallucination Reduction Post-processing (HRPP) to mitigate these errors via entity-masking and Direct Preference Optimization. Empirical results show Mol-Hallu better captures semantic biotechnology errors than traditional metrics and that HRPP consistently reduces hallucinations across decoder-only and encoder-decoder models, improving reliability for molecular design and analysis. The work provides practical, scalable tools to assess and reduce hallucination in scientific LLMs, with implications for drug discovery and cheminformatics pipelines.

Abstract

Large language models are increasingly used in scientific domains, especially for molecular understanding and analysis. However, existing models are affected by hallucination issues, resulting in errors in drug design and utilization. In this paper, we first analyze the sources of hallucination in LLMs for molecular comprehension tasks, specifically the knowledge shortcut phenomenon observed in the PubChem dataset. To evaluate hallucination in molecular comprehension tasks with computational efficiency, we introduce \textbf{Mol-Hallu}, a novel free-form evaluation metric that quantifies the degree of hallucination based on the scientific entailment relationship between generated text and actual molecular properties. Utilizing the Mol-Hallu metric, we reassess and analyze the extent of hallucination in various LLMs performing molecular comprehension tasks. Furthermore, the Hallucination Reduction Post-processing stage~(HRPP) is proposed to alleviate molecular hallucinations, Experiments show the effectiveness of HRPP on decoder-only and encoder-decoder molecular LLMs. Our findings provide critical insights into mitigating hallucination and improving the reliability of LLMs in scientific applications.

Paper Structure

This paper contains 24 sections, 10 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: (1) The top figure shows the scoring curves of Mol-Hallu v.s. traditional metrics (BLEU, ROUGE, METEOR) across varying degrees of hallucination. ${H}:n$ indicates that samples contain $n$ counterfactual errors, Mol-Hallu imposes an exponential penalty on hallucination errors in text., whereas traditional metrics fail to evaluate biochemical hallucination in texts reasonably. (2) The bottom figure proposes a biochemical sample that suffers severe hallucination (red are counterfactual entities) as an example. Mol-Hallu precisely reflects the hallucination degree in scientific texts compared to traditional metrics.
  • Figure 2: Experiments demonstrate that in both decoder-only LLMs and encoder-decoder LLMs, molecule masking attacking has little impact while drug masking and distracting attackings lead to substantial decrease. This indicates that the knowledge shortcut prompts LLMs to establish alignment between molecular properties and drug names instead of molecular structures, thereby deviating from the goal of molecular comprehension.
  • Figure 3: The pipeline of entity preference dataset and our hallucination-reduction post-processing stage. The entity preference dataset is generated by removing bio-knowledge shortcuts and replacing entities with hallucinations. Then we apply the entity preference dataset for scientific-entity hallucination alleviation during the HRPP stage.
  • Figure 4: Hallucination Distribution Comparison. We visualize the distributions of hallucination entity numbers between molecular LLMs (MolT5, Llama-3.1) and their de-hallucination versions. Our HRPP effectively mitigates the frequent occurrence of hallucinations in cases, shifting the distribution peak closer to 0.