Table of Contents
Fetching ...

Question Rephrasing for Quantifying Uncertainty in Large Language Models: Applications in Molecular Chemistry Tasks

Zizhang Chen, Pengyu Hong, Sandeep Madireddy

TL;DR

This work addresses the problem of assessing reliability in large language models applied to chemistry by combining Question Rephrasing to probe input uncertainty with sampling-based evaluation of output uncertainty. It formalizes two uncertainty channels, using SMILES variant perturbations and entropy-oriented metrics, including an explicit structure-based clustering approach for generation tasks. Experiments with GPT-4 and GPT-3.5 across molecular property and forward reaction prediction demonstrate that input variations can affect predictions and that entropy-based scores effectively indicate when model outputs are trustworthy, even when raw accuracy is low. The findings underscore the need to enhance foundational chemistry understanding in LLMs to enable more reliable and transparent AI for chemical informatics.

Abstract

Uncertainty quantification enables users to assess the reliability of responses generated by large language models (LLMs). We present a novel Question Rephrasing technique to evaluate the input uncertainty of LLMs, which refers to the uncertainty arising from equivalent variations of the inputs provided to LLMs. This technique is integrated with sampling methods that measure the output uncertainty of LLMs, thereby offering a more comprehensive uncertainty assessment. We validated our approach on property prediction and reaction prediction for molecular chemistry tasks.

Question Rephrasing for Quantifying Uncertainty in Large Language Models: Applications in Molecular Chemistry Tasks

TL;DR

This work addresses the problem of assessing reliability in large language models applied to chemistry by combining Question Rephrasing to probe input uncertainty with sampling-based evaluation of output uncertainty. It formalizes two uncertainty channels, using SMILES variant perturbations and entropy-oriented metrics, including an explicit structure-based clustering approach for generation tasks. Experiments with GPT-4 and GPT-3.5 across molecular property and forward reaction prediction demonstrate that input variations can affect predictions and that entropy-based scores effectively indicate when model outputs are trustworthy, even when raw accuracy is low. The findings underscore the need to enhance foundational chemistry understanding in LLMs to enable more reliable and transparent AI for chemical informatics.

Abstract

Uncertainty quantification enables users to assess the reliability of responses generated by large language models (LLMs). We present a novel Question Rephrasing technique to evaluate the input uncertainty of LLMs, which refers to the uncertainty arising from equivalent variations of the inputs provided to LLMs. This technique is integrated with sampling methods that measure the output uncertainty of LLMs, thereby offering a more comprehensive uncertainty assessment. We validated our approach on property prediction and reaction prediction for molecular chemistry tasks.
Paper Structure (13 sections, 4 equations, 2 figures, 4 tables)

This paper contains 13 sections, 4 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: SMILES representation variants of Aspirin. While all structures depict the same molecule, their SMILES representations are different, which introduces input variations. Top left: Canonical SMILES representation of Aspirin. Rest: Five SMILES variations of Aspirin.
  • Figure 2: ROC curve for evaluating the in predicting the correctness of the GPT using our uncertainty score.