Table of Contents
Fetching ...

Online Safety Analysis for LLMs: a Benchmark, an Assessment, and a Path Forward

Xuan Xie, Jiayang Song, Zhehua Zhou, Yuheng Huang, Da Song, Lei Ma

TL;DR

The paper addresses the need for real-time safety analysis of LLMs by introducing the first public benchmark that spans open- and closed-source models, diverse tasks, and multiple online safety analysis methods. It demonstrates, via a pilot study, that unsafe outputs can often be detected early in generation, and then systematically evaluates method performance across tasks using a five-metric framework. The findings reveal strengths and limitations of existing approaches, highlight the benefits and challenges of grey-box and entropy/likelihood-based methods, and show that simple hybridization can improve safety assessments in some settings. Together, these contributions offer a practical path toward trustworthy, LLM-specific online safety tooling and guide future research on more effective, scalable QA frameworks for real-time deployment.

Abstract

While Large Language Models (LLMs) have seen widespread applications across numerous fields, their limited interpretability poses concerns regarding their safe operations from multiple aspects, e.g., truthfulness, robustness, and fairness. Recent research has started developing quality assurance methods for LLMs, introducing techniques such as offline detector-based or uncertainty estimation methods. However, these approaches predominantly concentrate on post-generation analysis, leaving the online safety analysis for LLMs during the generation phase an unexplored area. To bridge this gap, we conduct in this work a comprehensive evaluation of the effectiveness of existing online safety analysis methods on LLMs. We begin with a pilot study that validates the feasibility of detecting unsafe outputs in the early generation process. Following this, we establish the first publicly available benchmark of online safety analysis for LLMs, including a broad spectrum of methods, models, tasks, datasets, and evaluation metrics. Utilizing this benchmark, we extensively analyze the performance of state-of-the-art online safety analysis methods on both open-source and closed-source LLMs. This analysis reveals the strengths and weaknesses of individual methods and offers valuable insights into selecting the most appropriate method based on specific application scenarios and task requirements. Furthermore, we also explore the potential of using hybridization methods, i.e., combining multiple methods to derive a collective safety conclusion, to enhance the efficacy of online safety analysis for LLMs. Our findings indicate a promising direction for the development of innovative and trustworthy quality assurance methodologies for LLMs, facilitating their reliable deployments across diverse domains.

Online Safety Analysis for LLMs: a Benchmark, an Assessment, and a Path Forward

TL;DR

The paper addresses the need for real-time safety analysis of LLMs by introducing the first public benchmark that spans open- and closed-source models, diverse tasks, and multiple online safety analysis methods. It demonstrates, via a pilot study, that unsafe outputs can often be detected early in generation, and then systematically evaluates method performance across tasks using a five-metric framework. The findings reveal strengths and limitations of existing approaches, highlight the benefits and challenges of grey-box and entropy/likelihood-based methods, and show that simple hybridization can improve safety assessments in some settings. Together, these contributions offer a practical path toward trustworthy, LLM-specific online safety tooling and guide future research on more effective, scalable QA frameworks for real-time deployment.

Abstract

While Large Language Models (LLMs) have seen widespread applications across numerous fields, their limited interpretability poses concerns regarding their safe operations from multiple aspects, e.g., truthfulness, robustness, and fairness. Recent research has started developing quality assurance methods for LLMs, introducing techniques such as offline detector-based or uncertainty estimation methods. However, these approaches predominantly concentrate on post-generation analysis, leaving the online safety analysis for LLMs during the generation phase an unexplored area. To bridge this gap, we conduct in this work a comprehensive evaluation of the effectiveness of existing online safety analysis methods on LLMs. We begin with a pilot study that validates the feasibility of detecting unsafe outputs in the early generation process. Following this, we establish the first publicly available benchmark of online safety analysis for LLMs, including a broad spectrum of methods, models, tasks, datasets, and evaluation metrics. Utilizing this benchmark, we extensively analyze the performance of state-of-the-art online safety analysis methods on both open-source and closed-source LLMs. This analysis reveals the strengths and weaknesses of individual methods and offers valuable insights into selecting the most appropriate method based on specific application scenarios and task requirements. Furthermore, we also explore the potential of using hybridization methods, i.e., combining multiple methods to derive a collective safety conclusion, to enhance the efficacy of online safety analysis for LLMs. Our findings indicate a promising direction for the development of innovative and trustworthy quality assurance methodologies for LLMs, facilitating their reliable deployments across diverse domains.
Paper Structure (28 sections, 18 equations, 7 figures, 8 tables)

This paper contains 28 sections, 18 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Overall workflow illustration.
  • Figure 2: Decoder-only LLM illustration.
  • Figure 3: Pilot Study Result of TruthfulQA, result in %.
  • Figure 4: Pilot Study Result of RealToxicityPrompt, result in %.
  • Figure 5: Pilot Study Result of MBPP, result in %.
  • ...and 2 more figures