Table of Contents
Fetching ...

SciFaultyQA: Benchmarking LLMs on Faulty Science Question Detection with a GAN-Inspired Approach to Synthetic Dataset Generation

Debarshi Kundu

TL;DR

SciFaultyQA addresses the problem of evaluating whether LLMs recognize flawed science questions rather than simply answering them. The authors propose a GAN-inspired synthetic data-generation pipeline, with multiple generator LLMs producing faulty questions and a discriminator LLM assessing faults, iterating on feedback to create fault-injected datasets from SciQA and SciQ. They demonstrate varying fault-detection performance across three LLMs and show substantial gains when employing multi-agent verification and web-search augmentation, up to $65\%$ accuracy. This work provides a scalable benchmark framework and practical strategies to improve LLM robustness in science QA, with future directions including diffusion-inspired fault injection and broader error-reduction techniques.

Abstract

Consider the problem: ``If one man and one woman can produce one child in one year, how many children will be produced by one woman and three men in 0.5 years?" Current large language models (LLMs) such as GPT-4o, GPT-o1-preview, and Gemini Flash frequently answer "0.5," which does not make sense. While these models sometimes acknowledge the unrealistic nature of the question, in many cases (8 out of 10 trials), they provide the nonsensical answer of "0.5 child." Additionally, temporal variation has been observed: if an LLM answers correctly once (by recognizing the faulty nature of the question), subsequent responses are more likely to also reflect this understanding. However, this is inconsistent. These types of questions have motivated us to develop a dataset of science questions, SciFaultyQA, where the questions themselves are intentionally faulty. We observed that LLMs often proceed to answer these flawed questions without recognizing their inherent issues, producing results that are logically or scientifically invalid. By analyzing such patterns, we developed a novel method for generating synthetic datasets to evaluate and benchmark the performance of various LLMs in identifying these flawed questions. We have also developed novel approaches to reduce the errors.

SciFaultyQA: Benchmarking LLMs on Faulty Science Question Detection with a GAN-Inspired Approach to Synthetic Dataset Generation

TL;DR

SciFaultyQA addresses the problem of evaluating whether LLMs recognize flawed science questions rather than simply answering them. The authors propose a GAN-inspired synthetic data-generation pipeline, with multiple generator LLMs producing faulty questions and a discriminator LLM assessing faults, iterating on feedback to create fault-injected datasets from SciQA and SciQ. They demonstrate varying fault-detection performance across three LLMs and show substantial gains when employing multi-agent verification and web-search augmentation, up to accuracy. This work provides a scalable benchmark framework and practical strategies to improve LLM robustness in science QA, with future directions including diffusion-inspired fault injection and broader error-reduction techniques.

Abstract

Consider the problem: ``If one man and one woman can produce one child in one year, how many children will be produced by one woman and three men in 0.5 years?" Current large language models (LLMs) such as GPT-4o, GPT-o1-preview, and Gemini Flash frequently answer "0.5," which does not make sense. While these models sometimes acknowledge the unrealistic nature of the question, in many cases (8 out of 10 trials), they provide the nonsensical answer of "0.5 child." Additionally, temporal variation has been observed: if an LLM answers correctly once (by recognizing the faulty nature of the question), subsequent responses are more likely to also reflect this understanding. However, this is inconsistent. These types of questions have motivated us to develop a dataset of science questions, SciFaultyQA, where the questions themselves are intentionally faulty. We observed that LLMs often proceed to answer these flawed questions without recognizing their inherent issues, producing results that are logically or scientifically invalid. By analyzing such patterns, we developed a novel method for generating synthetic datasets to evaluate and benchmark the performance of various LLMs in identifying these flawed questions. We have also developed novel approaches to reduce the errors.

Paper Structure

This paper contains 7 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: GAN inspired synthetic data generation flow