Table of Contents
Fetching ...

Toward Automatic Group Membership Annotation for Group Fairness Evaluation

Fumian Chen, Dayu Yang, Hui Fang

TL;DR

To address data sparsity in GM annotations for group fairness evaluation, the paper assesses automatic GM annotation using four language-model classifiers and compares their performance on fairness tasks. It finds BERT-based sentence classification achieves the best accuracy with minimal supervision, while generative LLMs are less competitive for discriminative GM labeling and incur high costs. Replacing human GM annotations with BERT-based GM annotations yields strong agreement with human-based fairness evaluations at both system and query levels across TREC fair ranking tasks and NT-CIR fairweb1. This work enables scalable, low-cost fairness evaluation and broadens the available IR datasets for fairness research, with code released for reproducibility.

Abstract

With the increasing research attention on fairness in information retrieval systems, more and more fairness-aware algorithms have been proposed to ensure fairness for a sustainable and healthy retrieval ecosystem. However, as the most adopted measurement of fairness-aware algorithms, group fairness evaluation metrics, require group membership information that needs massive human annotations and is barely available for general information retrieval datasets. This data sparsity significantly impedes the development of fairness-aware information retrieval studies. Hence, a practical, scalable, low-cost group membership annotation method is needed to assist or replace human annotations. This study explored how to leverage language models to automatically annotate group membership for group fairness evaluations, focusing on annotation accuracy and its impact. Our experimental results show that BERT-based models outperformed state-of-the-art large language models, including GPT and Mistral, achieving promising annotation accuracy with minimal supervision in recent fair-ranking datasets. Our impact-oriented evaluations reveal that minimal annotation error will not degrade the effectiveness and robustness of group fairness evaluation. The proposed annotation method reduces tremendous human efforts and expands the frontier of fairness-aware studies to more datasets.

Toward Automatic Group Membership Annotation for Group Fairness Evaluation

TL;DR

To address data sparsity in GM annotations for group fairness evaluation, the paper assesses automatic GM annotation using four language-model classifiers and compares their performance on fairness tasks. It finds BERT-based sentence classification achieves the best accuracy with minimal supervision, while generative LLMs are less competitive for discriminative GM labeling and incur high costs. Replacing human GM annotations with BERT-based GM annotations yields strong agreement with human-based fairness evaluations at both system and query levels across TREC fair ranking tasks and NT-CIR fairweb1. This work enables scalable, low-cost fairness evaluation and broadens the available IR datasets for fairness research, with code released for reproducibility.

Abstract

With the increasing research attention on fairness in information retrieval systems, more and more fairness-aware algorithms have been proposed to ensure fairness for a sustainable and healthy retrieval ecosystem. However, as the most adopted measurement of fairness-aware algorithms, group fairness evaluation metrics, require group membership information that needs massive human annotations and is barely available for general information retrieval datasets. This data sparsity significantly impedes the development of fairness-aware information retrieval studies. Hence, a practical, scalable, low-cost group membership annotation method is needed to assist or replace human annotations. This study explored how to leverage language models to automatically annotate group membership for group fairness evaluations, focusing on annotation accuracy and its impact. Our experimental results show that BERT-based models outperformed state-of-the-art large language models, including GPT and Mistral, achieving promising annotation accuracy with minimal supervision in recent fair-ranking datasets. Our impact-oriented evaluations reveal that minimal annotation error will not degrade the effectiveness and robustness of group fairness evaluation. The proposed annotation method reduces tremendous human efforts and expands the frontier of fairness-aware studies to more datasets.
Paper Structure (15 sections, 1 equation, 7 figures, 4 tables)

This paper contains 15 sections, 1 equation, 7 figures, 4 tables.

Figures (7)

  • Figure 1: The necessity of GM annotation in group fairness evaluation
  • Figure 2: Geo subgroup frequency of human GM annotation (TREC 2022).
  • Figure 3: Total price of annotation using trained GPT models (GPT-4, GPT-3.5-turbo, and fine-tuned GPT-3.5) by number of documents.
  • Figure 4: Classification performance by training sample size (TREC 2021).
  • Figure 5: BERT model (right) outperformed and is less sensitive to imbalanced classes than the bag-of-words model (left) (TREC 2021).
  • ...and 2 more figures