Table of Contents
Fetching ...

SciTrust 2.0: A Comprehensive Framework for Evaluating Trustworthiness of Large Language Models in Scientific Applications

Emily Herron, Junqi Yin, Feiyi Wang

TL;DR

SciTrust 2.0 presents a holistic framework for evaluating the trustworthiness of large language models in scientific contexts across four dimensions: truthfulness, adversarial robustness, scientific safety, and scientific ethics. It introduces novel benchmarks, including a reflection-tuning based open-ended truthfulness test and an eight-area scientific ethics benchmark, and conducts a thorough comparison of seven LLMs spanning science-specialized and general-purpose models. Results show general-purpose industry models generally outperform science-specialized counterparts across dimensions, with GPT-o4-mini leading in truthfulness and robustness, while specialized models lag in ethical reasoning and safety. By open-sourcing the framework, the work provides a foundation for safer, more ethically aligned AI in scientific research and encourages further multi-dimensional studies of LLM trustworthiness.

Abstract

Large language models (LLMs) have demonstrated transformative potential in scientific research, yet their deployment in high-stakes contexts raises significant trustworthiness concerns. Here, we introduce SciTrust 2.0, a comprehensive framework for evaluating LLM trustworthiness in scientific applications across four dimensions: truthfulness, adversarial robustness, scientific safety, and scientific ethics. Our framework incorporates novel, open-ended truthfulness benchmarks developed through a verified reflection-tuning pipeline and expert validation, alongside a novel ethics benchmark for scientific research contexts covering eight subcategories including dual-use research and bias. We evaluated seven prominent LLMs, including four science-specialized models and three general-purpose industry models, using multiple evaluation metrics including accuracy, semantic similarity measures, and LLM-based scoring. General-purpose industry models overall outperformed science-specialized models across each trustworthiness dimension, with GPT-o4-mini demonstrating superior performance in truthfulness assessments and adversarial robustness. Science-specialized models showed significant deficiencies in logical and ethical reasoning capabilities, along with concerning vulnerabilities in safety evaluations, particularly in high-risk domains such as biosecurity and chemical weapons. By open-sourcing our framework, we provide a foundation for developing more trustworthy AI systems and advancing research on model safety and ethics in scientific contexts.

SciTrust 2.0: A Comprehensive Framework for Evaluating Trustworthiness of Large Language Models in Scientific Applications

TL;DR

SciTrust 2.0 presents a holistic framework for evaluating the trustworthiness of large language models in scientific contexts across four dimensions: truthfulness, adversarial robustness, scientific safety, and scientific ethics. It introduces novel benchmarks, including a reflection-tuning based open-ended truthfulness test and an eight-area scientific ethics benchmark, and conducts a thorough comparison of seven LLMs spanning science-specialized and general-purpose models. Results show general-purpose industry models generally outperform science-specialized counterparts across dimensions, with GPT-o4-mini leading in truthfulness and robustness, while specialized models lag in ethical reasoning and safety. By open-sourcing the framework, the work provides a foundation for safer, more ethically aligned AI in scientific research and encourages further multi-dimensional studies of LLM trustworthiness.

Abstract

Large language models (LLMs) have demonstrated transformative potential in scientific research, yet their deployment in high-stakes contexts raises significant trustworthiness concerns. Here, we introduce SciTrust 2.0, a comprehensive framework for evaluating LLM trustworthiness in scientific applications across four dimensions: truthfulness, adversarial robustness, scientific safety, and scientific ethics. Our framework incorporates novel, open-ended truthfulness benchmarks developed through a verified reflection-tuning pipeline and expert validation, alongside a novel ethics benchmark for scientific research contexts covering eight subcategories including dual-use research and bias. We evaluated seven prominent LLMs, including four science-specialized models and three general-purpose industry models, using multiple evaluation metrics including accuracy, semantic similarity measures, and LLM-based scoring. General-purpose industry models overall outperformed science-specialized models across each trustworthiness dimension, with GPT-o4-mini demonstrating superior performance in truthfulness assessments and adversarial robustness. Science-specialized models showed significant deficiencies in logical and ethical reasoning capabilities, along with concerning vulnerabilities in safety evaluations, particularly in high-risk domains such as biosecurity and chemical weapons. By open-sourcing our framework, we provide a foundation for developing more trustworthy AI systems and advancing research on model safety and ethics in scientific contexts.

Paper Structure

This paper contains 31 sections, 13 figures, 12 tables.

Figures (13)

  • Figure 1: Overview of the SciTrust 2.0 Framework. The framework evaluates LLM trustworthiness in scientific contexts across four dimensions: (1) Truthfulness (factual accuracy and hallucination resistance), assessed through scientific knowledge benchmarks, logical reasoning tasks, and hallucination detection; (2) Adversarial Robustness (stability under perturbations), evaluated through multiple-choice and open-ended adversarial tests; (3) Scientific Safety (prevention of harmful outputs), measured via biosecurity, cybersecurity, and chemical security benchmarks; and (4) Scientific Ethics (research integrity alignment), assessed using our novel ethics benchmark covering eight areas of scientific research ethics. The framework employs multiple evaluation metrics including lexical and semantic similarity measures, accuracy scores, and LLM-based qualitative assessment to compare performance between science-specialized models and general-purpose industry models.
  • Figure 2: Expert ratings of Q&A pairs generated from research articles and literature reviews across different stages of the reflection-tuning pipeline. Mean scores (scale 1-5) are shown for five quality dimensions: helpfulness, relevance, accuracy, level of detail, and contextual independence. Results demonstrate progressive improvement through the pipeline, with the most substantial gains observed in level of detail, helpfulness, and contextual independence after the response reflection phase.
  • Figure 3: Reflection-tuning pipeline architecture for generating high-quality scientific question-answer pairs. The process begins with scientific literature review corpus selection, followed by three sequential stages: (1) initial Q&A pair generation using an oracle model, (2) instruction reflection tuning to improve question quality and contextual independence, and (3) response reflection tuning to enhance answer accuracy and completeness. The full prompts used at each stage are provided in Appendix A.
  • Figure 4: Example question-answer pair from the Open-Ended Computer Science dataset generated using our reflection-tuning pipeline.
  • Figure 5: Performance changes in accuracy across multiple-choice scientific benchmarks under adversarial perturbations. Values represent percentage point changes from baseline accuracy when models are evaluated on adversarially modified versions of SciQ, GPQA-Diamond, and ARC-C datasets. Color intensity corresponds to magnitude of accuracy reduction, with darker colors indicating greater vulnerability to adversarial attacks.
  • ...and 8 more figures