Testing and Evaluation of Large Language Models: Correctness, Non-Toxicity, and Fairness

Wenxuan Wang

Testing and Evaluation of Large Language Models: Correctness, Non-Toxicity, and Fairness

Wenxuan Wang

TL;DR

This thesis introduces the exploratory works in the field of language model reliability during the PhD study, focusing on the correctness, non-toxicity, and fairness of LLMs from both software testing and natural language processing perspectives.

Abstract

Large language models (LLMs), such as ChatGPT, have rapidly penetrated into people's work and daily lives over the past few years, due to their extraordinary conversational skills and intelligence. ChatGPT has become the fastest-growing software in terms of user numbers in human history and become an important foundational model for the next generation of artificial intelligence applications. However, the generations of LLMs are not entirely reliable, often producing content with factual errors, biases, and toxicity. Given their vast number of users and wide range of application scenarios, these unreliable responses can lead to many serious negative impacts. This thesis introduces the exploratory works in the field of language model reliability during the PhD study, focusing on the correctness, non-toxicity, and fairness of LLMs from both software testing and natural language processing perspectives. First, to measure the correctness of LLMs, we introduce two testing frameworks, FactChecker and LogicAsker, to evaluate factual knowledge and logical reasoning accuracy, respectively. Second, for the non-toxicity of LLMs, we introduce two works for red-teaming LLMs. Third, to evaluate the fairness of LLMs, we introduce two evaluation frameworks, BiasAsker and XCulturalBench, to measure the social bias and cultural bias of LLMs, respectively.

Testing and Evaluation of Large Language Models: Correctness, Non-Toxicity, and Fairness

TL;DR

Abstract

Paper Structure (159 sections, 9 equations, 26 figures, 40 tables)

This paper contains 159 sections, 9 equations, 26 figures, 40 tables.

Introduction
Overview
Thesis Contributions
Publications During Ph.D. Study
Thesis Organization
Background Review
Large Language Models
Pre-Training Language Models
Large Language Models
Software Testing
Definition
Taxonomy
Limitation and Our Focus
LLMs Evaluation Benchmarks
Natural Language Processing Tasks
...and 144 more sections

Figures (26)

Figure 1: Example of unreliable generation from ChatGPT.
Figure 2: Overview of the research in this thesis. This figure visualizes the research outcomes during my PhD study. The foci of this thesis are highlighted in bold.
Figure 3: Overview of the background review as well as the landmarks of the research work in this thesis.
Figure 4: The architectures of Skip-gram and Continuous Bag of Words models.
Figure 5: The architecture of ELMo devlin2019bert.
...and 21 more figures

Testing and Evaluation of Large Language Models: Correctness, Non-Toxicity, and Fairness

TL;DR

Abstract

Testing and Evaluation of Large Language Models: Correctness, Non-Toxicity, and Fairness

Authors

TL;DR

Abstract

Table of Contents

Figures (26)