Toward Automatic Group Membership Annotation for Group Fairness Evaluation

Fumian Chen; Dayu Yang; Hui Fang

Toward Automatic Group Membership Annotation for Group Fairness Evaluation

Fumian Chen, Dayu Yang, Hui Fang

TL;DR

To address data sparsity in GM annotations for group fairness evaluation, the paper assesses automatic GM annotation using four language-model classifiers and compares their performance on fairness tasks. It finds BERT-based sentence classification achieves the best accuracy with minimal supervision, while generative LLMs are less competitive for discriminative GM labeling and incur high costs. Replacing human GM annotations with BERT-based GM annotations yields strong agreement with human-based fairness evaluations at both system and query levels across TREC fair ranking tasks and NT-CIR fairweb1. This work enables scalable, low-cost fairness evaluation and broadens the available IR datasets for fairness research, with code released for reproducibility.

Abstract

With the increasing research attention on fairness in information retrieval systems, more and more fairness-aware algorithms have been proposed to ensure fairness for a sustainable and healthy retrieval ecosystem. However, as the most adopted measurement of fairness-aware algorithms, group fairness evaluation metrics, require group membership information that needs massive human annotations and is barely available for general information retrieval datasets. This data sparsity significantly impedes the development of fairness-aware information retrieval studies. Hence, a practical, scalable, low-cost group membership annotation method is needed to assist or replace human annotations. This study explored how to leverage language models to automatically annotate group membership for group fairness evaluations, focusing on annotation accuracy and its impact. Our experimental results show that BERT-based models outperformed state-of-the-art large language models, including GPT and Mistral, achieving promising annotation accuracy with minimal supervision in recent fair-ranking datasets. Our impact-oriented evaluations reveal that minimal annotation error will not degrade the effectiveness and robustness of group fairness evaluation. The proposed annotation method reduces tremendous human efforts and expands the frontier of fairness-aware studies to more datasets.

Toward Automatic Group Membership Annotation for Group Fairness Evaluation

TL;DR

Abstract

Paper Structure (15 sections, 1 equation, 7 figures, 4 tables)

This paper contains 15 sections, 1 equation, 7 figures, 4 tables.

Introduction
Related Work
Automate GM Annotation for Fairness Evaluation
GM in Fairness Evaluations
Challenges with GM annotation
Annotating GM by Text Classification with Language Models
Evaluation and Analysis
Prediction Accuracy of GM Annotation Models
Group Fairness Evaluation with GM annotations
System-level Evaluation
Query-level Robustness.
Impact of the Annotation Accuracy
Generalizability of System Evaluation
Conclusion
Acknowledgments.

Figures (7)

Figure 1: The necessity of GM annotation in group fairness evaluation
Figure 2: Geo subgroup frequency of human GM annotation (TREC 2022).
Figure 3: Total price of annotation using trained GPT models (GPT-4, GPT-3.5-turbo, and fine-tuned GPT-3.5) by number of documents.
Figure 4: Classification performance by training sample size (TREC 2021).
Figure 5: BERT model (right) outperformed and is less sensitive to imbalanced classes than the bag-of-words model (left) (TREC 2021).
...and 2 more figures

Toward Automatic Group Membership Annotation for Group Fairness Evaluation

TL;DR

Abstract

Toward Automatic Group Membership Annotation for Group Fairness Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)