Table of Contents
Fetching ...

Can LLMs replace Neil deGrasse Tyson? Evaluating the Reliability of LLMs as Science Communicators

Prasoon Bajpai, Niladri Chatterjee, Subhabrata Dutta, Tanmoy Chakraborty

TL;DR

This work introduces a novel dataset, SCiPS-QA, comprising 742 Yes/No queries embedded in complex scientific concepts, along with a benchmarking suite that evaluates LLMs for correctness and consistency across various criteria, and finds that even the GPT models exhibit a general incompetence in reliably verifying LLM responses.

Abstract

Large Language Models (LLMs) and AI assistants driven by these models are experiencing exponential growth in usage among both expert and amateur users. In this work, we focus on evaluating the reliability of current LLMs as science communicators. Unlike existing benchmarks, our approach emphasizes assessing these models on scientific questionanswering tasks that require a nuanced understanding and awareness of answerability. We introduce a novel dataset, SCiPS-QA, comprising 742 Yes/No queries embedded in complex scientific concepts, along with a benchmarking suite that evaluates LLMs for correctness and consistency across various criteria. We benchmark three proprietary LLMs from the OpenAI GPT family and 13 open-access LLMs from the Meta Llama-2, Llama-3, and Mistral families. While most open-access models significantly underperform compared to GPT-4 Turbo, our experiments identify Llama-3-70B as a strong competitor, often surpassing GPT-4 Turbo in various evaluation aspects. We also find that even the GPT models exhibit a general incompetence in reliably verifying LLM responses. Moreover, we observe an alarming trend where human evaluators are deceived by incorrect responses from GPT-4 Turbo.

Can LLMs replace Neil deGrasse Tyson? Evaluating the Reliability of LLMs as Science Communicators

TL;DR

This work introduces a novel dataset, SCiPS-QA, comprising 742 Yes/No queries embedded in complex scientific concepts, along with a benchmarking suite that evaluates LLMs for correctness and consistency across various criteria, and finds that even the GPT models exhibit a general incompetence in reliably verifying LLM responses.

Abstract

Large Language Models (LLMs) and AI assistants driven by these models are experiencing exponential growth in usage among both expert and amateur users. In this work, we focus on evaluating the reliability of current LLMs as science communicators. Unlike existing benchmarks, our approach emphasizes assessing these models on scientific questionanswering tasks that require a nuanced understanding and awareness of answerability. We introduce a novel dataset, SCiPS-QA, comprising 742 Yes/No queries embedded in complex scientific concepts, along with a benchmarking suite that evaluates LLMs for correctness and consistency across various criteria. We benchmark three proprietary LLMs from the OpenAI GPT family and 13 open-access LLMs from the Meta Llama-2, Llama-3, and Mistral families. While most open-access models significantly underperform compared to GPT-4 Turbo, our experiments identify Llama-3-70B as a strong competitor, often surpassing GPT-4 Turbo in various evaluation aspects. We also find that even the GPT models exhibit a general incompetence in reliably verifying LLM responses. Moreover, we observe an alarming trend where human evaluators are deceived by incorrect responses from GPT-4 Turbo.
Paper Structure (43 sections, 4 equations, 8 figures, 7 tables)

This paper contains 43 sections, 4 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Examples of wrong reasonings given by GPT-4 Turbo to problems in SCiPS-QA: (Physics -- Air can cast a shadow under conditions of non-uniform refractive index physics_example; Chemistry -- The complex is chiral with D3 symmetry chemistry_example; Mathematics -- The paper discusses the model completeness of the real exponential field and its connection to Tarski's problem and the first root conjecture. Tarski's problem is an open problem mathematics_example).
  • Figure 2: Performance of GPT-4 Turbo on a random subset (of size 40) of MMLU-Pro, SciQ and SCiPS-QA. GPT-4 Turbo performs worst on SCiPS-QA across all subjects.
  • Figure 3: Verification of the reasoning passages generated by GPT-4 Turbo across convincingness (with and without answer), factuality, and information mismatch; we use both GPT-4 Turbo and GPT-3.5 Turbo as verifier models. The fraction of correct (incorrect) responses at each score level is shown in blue (red). An ideal verifier should provide all the incorrect responses with the lowest score (1) and all the correct responses with the highest score (5). However, no verifier model in our experiments could demarcate between the correct and incorrect responses.
  • Figure 4: Distribution of correct (in blue) and incorrect (in red) responses generated by GPT-4 Turbo against convince factor scores provided by human evaluators. Incorrect LLM reasoning can deceive humans as convincing with or without the answer shown to them. However, humans provide better judgement with the answer.
  • Figure 5: Topic decompostion for subjects : Physics (top-left), Chemistry (middle) & Mathematics (top-right) in SCiPS-QA
  • ...and 3 more figures