SciTrust 2.0: A Comprehensive Framework for Evaluating Trustworthiness of Large Language Models in Scientific Applications

Emily Herron; Junqi Yin; Feiyi Wang

SciTrust 2.0: A Comprehensive Framework for Evaluating Trustworthiness of Large Language Models in Scientific Applications

Emily Herron, Junqi Yin, Feiyi Wang

TL;DR

SciTrust 2.0 presents a holistic framework for evaluating the trustworthiness of large language models in scientific contexts across four dimensions: truthfulness, adversarial robustness, scientific safety, and scientific ethics. It introduces novel benchmarks, including a reflection-tuning based open-ended truthfulness test and an eight-area scientific ethics benchmark, and conducts a thorough comparison of seven LLMs spanning science-specialized and general-purpose models. Results show general-purpose industry models generally outperform science-specialized counterparts across dimensions, with GPT-o4-mini leading in truthfulness and robustness, while specialized models lag in ethical reasoning and safety. By open-sourcing the framework, the work provides a foundation for safer, more ethically aligned AI in scientific research and encourages further multi-dimensional studies of LLM trustworthiness.

Abstract

Large language models (LLMs) have demonstrated transformative potential in scientific research, yet their deployment in high-stakes contexts raises significant trustworthiness concerns. Here, we introduce SciTrust 2.0, a comprehensive framework for evaluating LLM trustworthiness in scientific applications across four dimensions: truthfulness, adversarial robustness, scientific safety, and scientific ethics. Our framework incorporates novel, open-ended truthfulness benchmarks developed through a verified reflection-tuning pipeline and expert validation, alongside a novel ethics benchmark for scientific research contexts covering eight subcategories including dual-use research and bias. We evaluated seven prominent LLMs, including four science-specialized models and three general-purpose industry models, using multiple evaluation metrics including accuracy, semantic similarity measures, and LLM-based scoring. General-purpose industry models overall outperformed science-specialized models across each trustworthiness dimension, with GPT-o4-mini demonstrating superior performance in truthfulness assessments and adversarial robustness. Science-specialized models showed significant deficiencies in logical and ethical reasoning capabilities, along with concerning vulnerabilities in safety evaluations, particularly in high-risk domains such as biosecurity and chemical weapons. By open-sourcing our framework, we provide a foundation for developing more trustworthy AI systems and advancing research on model safety and ethics in scientific contexts.

SciTrust 2.0: A Comprehensive Framework for Evaluating Trustworthiness of Large Language Models in Scientific Applications

TL;DR

Abstract

SciTrust 2.0: A Comprehensive Framework for Evaluating Trustworthiness of Large Language Models in Scientific Applications

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)