Table of Contents
Fetching ...

FFT: Towards Harmlessness Evaluation and Analysis for LLMs with Factuality, Fairness, Toxicity

Shiyao Cui, Zhenyu Zhang, Yilong Chen, Wenyuan Zhang, Tianyun Liu, Siqi Wang, Tingwen Liu

TL;DR

FFT tackles the problem of evaluating LLM harmlessness beyond toxicity by introducing a comprehensive benchmark with $2116$ instances across factuality, fairness, and toxicity. It combines adversarial seeds, diverse scenarios, and jailbreak prompts to stress-test nine representative LLMs under zero-shot and limited-shot settings, revealing that harmlessness is not yet satisfactory and that training signals like SFT and RLHF can improve safety. The work provides detailed methodology for seed collection, template construction, and multi-faceted evaluation metrics, offering practical guidance for future safe-LLM development and benchmarking. Overall, FFT enables a more holistic assessment of LLM harms and highlights where progress is most needed for real-world deployment.

Abstract

The widespread of generative artificial intelligence has heightened concerns about the potential harms posed by AI-generated texts, primarily stemming from factoid, unfair, and toxic content. Previous researchers have invested much effort in assessing the harmlessness of generative language models. However, existing benchmarks are struggling in the era of large language models (LLMs), due to the stronger language generation and instruction following capabilities, as well as wider applications. In this paper, we propose FFT, a new benchmark with 2116 elaborated-designed instances, for LLM harmlessness evaluation with factuality, fairness, and toxicity. To investigate the potential harms of LLMs, we evaluate 9 representative LLMs covering various parameter scales, training stages, and creators. Experiments show that the harmlessness of LLMs is still under-satisfactory, and extensive analysis derives some insightful findings that could inspire future research for harmless LLM research.

FFT: Towards Harmlessness Evaluation and Analysis for LLMs with Factuality, Fairness, Toxicity

TL;DR

FFT tackles the problem of evaluating LLM harmlessness beyond toxicity by introducing a comprehensive benchmark with instances across factuality, fairness, and toxicity. It combines adversarial seeds, diverse scenarios, and jailbreak prompts to stress-test nine representative LLMs under zero-shot and limited-shot settings, revealing that harmlessness is not yet satisfactory and that training signals like SFT and RLHF can improve safety. The work provides detailed methodology for seed collection, template construction, and multi-faceted evaluation metrics, offering practical guidance for future safe-LLM development and benchmarking. Overall, FFT enables a more holistic assessment of LLM harms and highlights where progress is most needed for real-world deployment.

Abstract

The widespread of generative artificial intelligence has heightened concerns about the potential harms posed by AI-generated texts, primarily stemming from factoid, unfair, and toxic content. Previous researchers have invested much effort in assessing the harmlessness of generative language models. However, existing benchmarks are struggling in the era of large language models (LLMs), due to the stronger language generation and instruction following capabilities, as well as wider applications. In this paper, we propose FFT, a new benchmark with 2116 elaborated-designed instances, for LLM harmlessness evaluation with factuality, fairness, and toxicity. To investigate the potential harms of LLMs, we evaluate 9 representative LLMs covering various parameter scales, training stages, and creators. Experiments show that the harmlessness of LLMs is still under-satisfactory, and extensive analysis derives some insightful findings that could inspire future research for harmless LLM research.
Paper Structure (24 sections, 3 figures, 8 tables)

This paper contains 24 sections, 3 figures, 8 tables.

Figures (3)

  • Figure 1: Examples of three kinds of potential harms of LLM-generated contents. LLMs should provide accurate, neutral and moral responses.
  • Figure 2: Evaluation scheme with example queries, in which the queries for credit, criminal and health assessment are written in brief, see Appendix.\ref{['appe:fairness-prompts']} for the complete examples.
  • Figure 3: Example prompts to credit assessment query with "female" identity, criminal assessment query with "female" identity, and health assessment query with "male" identity (from top to bottom).