Taxonomy-based CheckList for Large Language Model Evaluation

Damin Zhang

Taxonomy-based CheckList for Large Language Model Evaluation

Damin Zhang

TL;DR

The paper addresses gender-bias evaluation in QA by introducing a taxonomy-based CheckList that leverages human knowledge and attribute-level prompting to measure additive consistency via $f(q)=f(a)=c$ and $f(q+a)=c$. It constructs a dataset by linking 62 occupations from the O*NET-SOC 2019 taxonomy to three attribute categories (skill, knowledge, ability), evaluated on RoBERTa-large and GPT-3.5-turbo-instruct in zero-shot settings. The main contributions are a taxonomy-informed bias evaluation framework, a corresponding bias-annotated dataset for cross-model comparison, and empirical findings that model bias is highly model-dependent, with partial mitigation from alignment in LLMs and persistent biases in transformer-based models. This framework enables deeper bias discovery and benchmarking, guiding future mitigation efforts and broader open-LLM evaluations.

Abstract

As large language models (LLMs) have been used in many downstream tasks, the internal stereotypical representation may affect the fairness of the outputs. In this work, we introduce human knowledge into natural language interventions and study pre-trained language models' (LMs) behaviors within the context of gender bias. Inspired by CheckList behavioral testing, we present a checklist-style task that aims to probe and quantify LMs' unethical behaviors through question-answering (QA). We design three comparison studies to evaluate LMs from four aspects: consistency, biased tendency, model preference, and gender preference switch. We probe one transformer-based QA model trained on SQuAD-v2 dataset and one autoregressive large language model. Our results indicate that transformer-based QA model's biased tendency positively correlates with its consistency, whereas LLM shows the opposite relation. Our proposed task provides the first dataset that involves human knowledge for LLM bias evaluation.

Taxonomy-based CheckList for Large Language Model Evaluation

TL;DR

The paper addresses gender-bias evaluation in QA by introducing a taxonomy-based CheckList that leverages human knowledge and attribute-level prompting to measure additive consistency via

and

. It constructs a dataset by linking 62 occupations from the O*NET-SOC 2019 taxonomy to three attribute categories (skill, knowledge, ability), evaluated on RoBERTa-large and GPT-3.5-turbo-instruct in zero-shot settings. The main contributions are a taxonomy-informed bias evaluation framework, a corresponding bias-annotated dataset for cross-model comparison, and empirical findings that model bias is highly model-dependent, with partial mitigation from alignment in LLMs and persistent biases in transformer-based models. This framework enables deeper bias discovery and benchmarking, guiding future mitigation efforts and broader open-LLM evaluations.

Abstract

Paper Structure (13 sections, 1 figure, 3 tables)

This paper contains 13 sections, 1 figure, 3 tables.

Introduction
Chain-of-Thought CheckList
Taxonomy-based Context
Dataset Construction
Experiments
Zero-shot Evaluation
Language Model's Logical Consistency
Model Preference
Model Bias
Gender Preference Switch
Results and Analysis
Related Work
Conclusion and Future Work

Figures (1)

Figure 1: Aggregated average scores across gendered names for different aspects: consistency, bias, model preference of female, model preference of male, female switch to male, male switch to female.

Taxonomy-based CheckList for Large Language Model Evaluation

TL;DR

Abstract

Taxonomy-based CheckList for Large Language Model Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (1)