Table of Contents
Fetching ...

Strategic Demonstration Selection for Improved Fairness in LLM In-Context Learning

Jingyu Hu, Weiru Liu, Mengnan Du

TL;DR

The paper addresses fairness in large language model in-context learning (ICL) for tabular data, showing that demonstration choice can meaningfully shift subgroup fairness. It introduces the Fairness via Clustering-Genetic (FCG) algorithm to curate diverse, representative demonstrations by clustering data into four subgroup conditions and applying a genetic-like selection to optimize a joint predictive-fair score. Across Credit and Adult datasets and five LLMs, minority-focused demonstrations improve fairness metrics such as $R_{dp}$ and $R_{eo}$ with minimal impact on accuracy, and FCG further enhances fairness compared with random or LENs-based methods. The work provides a practical, compute-efficient approach to improve real-world fairness in ICL prompts, with implications for deploying LLMs in high-stakes tabular tasks and for future exploration of multi-attribute fairness and fine-tuning effects.

Abstract

Recent studies highlight the effectiveness of using in-context learning (ICL) to steer large language models (LLMs) in processing tabular data, a challenging task given the structured nature of such data. Despite advancements in performance, the fairness implications of these methods are less understood. This study investigates how varying demonstrations within ICL prompts influence the fairness outcomes of LLMs. Our findings reveal that deliberately including minority group samples in prompts significantly boosts fairness without sacrificing predictive accuracy. Further experiments demonstrate that the proportion of minority to majority samples in demonstrations affects the trade-off between fairness and prediction accuracy. Based on these insights, we introduce a mitigation technique that employs clustering and evolutionary strategies to curate a diverse and representative sample set from the training data. This approach aims to enhance both predictive performance and fairness in ICL applications. Experimental results validate that our proposed method dramatically improves fairness across various metrics, showing its efficacy in real-world scenarios.

Strategic Demonstration Selection for Improved Fairness in LLM In-Context Learning

TL;DR

The paper addresses fairness in large language model in-context learning (ICL) for tabular data, showing that demonstration choice can meaningfully shift subgroup fairness. It introduces the Fairness via Clustering-Genetic (FCG) algorithm to curate diverse, representative demonstrations by clustering data into four subgroup conditions and applying a genetic-like selection to optimize a joint predictive-fair score. Across Credit and Adult datasets and five LLMs, minority-focused demonstrations improve fairness metrics such as and with minimal impact on accuracy, and FCG further enhances fairness compared with random or LENs-based methods. The work provides a practical, compute-efficient approach to improve real-world fairness in ICL prompts, with implications for deploying LLMs in high-stakes tabular tasks and for future exploration of multi-attribute fairness and fine-tuning effects.

Abstract

Recent studies highlight the effectiveness of using in-context learning (ICL) to steer large language models (LLMs) in processing tabular data, a challenging task given the structured nature of such data. Despite advancements in performance, the fairness implications of these methods are less understood. This study investigates how varying demonstrations within ICL prompts influence the fairness outcomes of LLMs. Our findings reveal that deliberately including minority group samples in prompts significantly boosts fairness without sacrificing predictive accuracy. Further experiments demonstrate that the proportion of minority to majority samples in demonstrations affects the trade-off between fairness and prediction accuracy. Based on these insights, we introduce a mitigation technique that employs clustering and evolutionary strategies to curate a diverse and representative sample set from the training data. This approach aims to enhance both predictive performance and fairness in ICL applications. Experimental results validate that our proposed method dramatically improves fairness across various metrics, showing its efficacy in real-world scenarios.
Paper Structure (18 sections, 5 equations, 5 figures, 17 tables, 1 algorithm)

This paper contains 18 sections, 5 equations, 5 figures, 17 tables, 1 algorithm.

Figures (5)

  • Figure 1: The Prompt Template and Content Example (*Perturbation is optional and is used to test the effectiveness of ICL. We discussed perturbations in Section \ref{['subsec:perturb']})
  • Figure 2: Prediction and fairness performance comparison across different LLMs on Credit dataset. It shows improvements in fairness metrics when samples from minority groups are prioritized.
  • Figure 3: The Workflow of Perturbations
  • Figure 4: The Workflow of Fairness via Clustering-Genetic (FCG) on the Adult Dataset ($r_y=1$, all high-income; $r_y=0$, all low-income; $r_y=0.5$, balanced samples of high-income and low-income; $r_z=1$, all females; $r_z=0$, all males; $r_z=0.5$, balanced samples of females and males.)
  • Figure 5: The workflow comparison of demonstration selection algorithms: FCG (Ours, proposed in Section \ref{['sec:fairer-mitigation']}), LENs, and LENs with FCG.