Table of Contents
Fetching ...

Testing Uncertainty of Large Language Models for Physics Knowledge and Reasoning

Elizaveta Reganova, Peter Steinbach

TL;DR

This work introduces an analysis for evaluating the performance of popular open-source LLMs, as well as gpt-3.5 Turbo, on multiple choice physics questionnaires, and focuses on the relationship between answer accuracy and variability in topics related to physics.

Abstract

Large Language Models (LLMs) have gained significant popularity in recent years for their ability to answer questions in various fields. However, these models have a tendency to "hallucinate" their responses, making it challenging to evaluate their performance. A major challenge is determining how to assess the certainty of a model's predictions and how it correlates with accuracy. In this work, we introduce an analysis for evaluating the performance of popular open-source LLMs, as well as gpt-3.5 Turbo, on multiple choice physics questionnaires. We focus on the relationship between answer accuracy and variability in topics related to physics. Our findings suggest that most models provide accurate replies in cases where they are certain, but this is by far not a general behavior. The relationship between accuracy and uncertainty exposes a broad horizontal bell-shaped distribution. We report how the asymmetry between accuracy and uncertainty intensifies as the questions demand more logical reasoning of the LLM agent, while the same relationship remains sharp for knowledge retrieval tasks.

Testing Uncertainty of Large Language Models for Physics Knowledge and Reasoning

TL;DR

This work introduces an analysis for evaluating the performance of popular open-source LLMs, as well as gpt-3.5 Turbo, on multiple choice physics questionnaires, and focuses on the relationship between answer accuracy and variability in topics related to physics.

Abstract

Large Language Models (LLMs) have gained significant popularity in recent years for their ability to answer questions in various fields. However, these models have a tendency to "hallucinate" their responses, making it challenging to evaluate their performance. A major challenge is determining how to assess the certainty of a model's predictions and how it correlates with accuracy. In this work, we introduce an analysis for evaluating the performance of popular open-source LLMs, as well as gpt-3.5 Turbo, on multiple choice physics questionnaires. We focus on the relationship between answer accuracy and variability in topics related to physics. Our findings suggest that most models provide accurate replies in cases where they are certain, but this is by far not a general behavior. The relationship between accuracy and uncertainty exposes a broad horizontal bell-shaped distribution. We report how the asymmetry between accuracy and uncertainty intensifies as the questions demand more logical reasoning of the LLM agent, while the same relationship remains sharp for knowledge retrieval tasks.

Paper Structure

This paper contains 18 sections, 13 equations, 5 figures.

Figures (5)

  • Figure 1: Entropy obtained from the distribution of answers to single questions of the mlphys101 dataset mlphys101 for all four models.
  • Figure 2: Two-dimensional Histogram of Error Rate (1 - Accuracy) vs. Entropy across Models. The binning of entropy is identical to figure \ref{['hist']}.
  • Figure 3: Accuracy-certainty trade-off for each LLM in five question categories
  • Figure 4: Two-dimensional Histogram of Error Rate (1 - Accuracy) vs. Entropy across Models with counts per bin. Entries are identical to figure \ref{['curve']}.
  • Figure 5: A. Two-dimensional histogram of (1 - Accuracy) vs. Entropy for the Mistral 7B model, shown alongside the theoretical curve (red) representing the scenario where the model provides only two distinct responses, one of which is correct (see Equation\ref{['eq1']}). B. Theoretical curves for binary responses (red) and three distinct responses (blue) with $p_{\_i1}$ = {0.1, 0.3, 0.5, 0.7, 0.9} (see Equation \ref{['eq2']}). Curve intensity increases as $p_{\_i1}$ increases. C. Theoretical curves (Equation \ref{['eq3']}) for three (blue) and four (red) distinct responses, with $p_{\_i1}$ = 0.3 and $p_{\_i2}$ = {0.05, 0.1, 0.3, 0.5, 0.65}. Curve intensity increases as $p_{\_i2}$ increases. D. Theoretical curves based on the general equation \ref{['eq:4']} with parameters $p_{\_i1}$, $p_{i2}$, and $p_{i3}$ varying within {0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0}.