Table of Contents
Fetching ...

Improving LLM Group Fairness on Tabular Data via In-Context Learning

Valeriia Cherepanova, Chia-Jung Lee, Nil-Jana Akpinar, Riccardo Fogliato, Martin Andres Bertran, Michael Kearns, James Zou

TL;DR

The paper addresses fairness gaps when deploying LLMs on tabular data in low-data regimes and proposes four in-context learning strategies to improve demographic parity: fair prompt optimization, soft prompt tuning, fair few-shot examples, and self-refinement. These methods are evaluated across four tabular datasets using both open-source and proprietary LLMs, with demographic parity defined as $ \mathbb{E}[f(X)\mid G=g] = \mathbb{E}[f(X)]$ for all groups $g$, and parity assessed via the ratio $ \min_g \mathbb{E}[f(X)\mid G=g] / \max_g \mathbb{E}[f(X)\mid G=g] $. Results show that fair prompt optimization and fair few-shot examples often achieve strong DP while preserving or improving accuracy; soft prompt tuning and self-refinement offer complementary trade-offs depending on model size and compute. The work provides actionable guidance for practitioners on selecting methods based on data availability, model scale, and fairness requirements, and suggests these strategies can extend to other fairness notions beyond demographic parity.

Abstract

Large language models (LLMs) have been shown to be effective on tabular prediction tasks in the low-data regime, leveraging their internal knowledge and ability to learn from instructions and examples. However, LLMs can fail to generate predictions that satisfy group fairness, that is, produce equitable outcomes across groups. Critically, conventional debiasing approaches for natural language tasks do not directly translate to mitigating group unfairness in tabular settings. In this work, we systematically investigate four empirical approaches to improve group fairness of LLM predictions on tabular datasets, including fair prompt optimization, soft prompt tuning, strategic selection of few-shot examples, and self-refining predictions via chain-of-thought reasoning. Through experiments on four tabular datasets using both open-source and proprietary LLMs, we show the effectiveness of these methods in enhancing demographic parity while maintaining high overall performance. Our analysis provides actionable insights for practitioners in selecting the most suitable approach based on their specific requirements and constraints.

Improving LLM Group Fairness on Tabular Data via In-Context Learning

TL;DR

The paper addresses fairness gaps when deploying LLMs on tabular data in low-data regimes and proposes four in-context learning strategies to improve demographic parity: fair prompt optimization, soft prompt tuning, fair few-shot examples, and self-refinement. These methods are evaluated across four tabular datasets using both open-source and proprietary LLMs, with demographic parity defined as for all groups , and parity assessed via the ratio . Results show that fair prompt optimization and fair few-shot examples often achieve strong DP while preserving or improving accuracy; soft prompt tuning and self-refinement offer complementary trade-offs depending on model size and compute. The work provides actionable guidance for practitioners on selecting methods based on data availability, model scale, and fairness requirements, and suggests these strategies can extend to other fairness notions beyond demographic parity.

Abstract

Large language models (LLMs) have been shown to be effective on tabular prediction tasks in the low-data regime, leveraging their internal knowledge and ability to learn from instructions and examples. However, LLMs can fail to generate predictions that satisfy group fairness, that is, produce equitable outcomes across groups. Critically, conventional debiasing approaches for natural language tasks do not directly translate to mitigating group unfairness in tabular settings. In this work, we systematically investigate four empirical approaches to improve group fairness of LLM predictions on tabular datasets, including fair prompt optimization, soft prompt tuning, strategic selection of few-shot examples, and self-refining predictions via chain-of-thought reasoning. Through experiments on four tabular datasets using both open-source and proprietary LLMs, we show the effectiveness of these methods in enhancing demographic parity while maintaining high overall performance. Our analysis provides actionable insights for practitioners in selecting the most suitable approach based on their specific requirements and constraints.

Paper Structure

This paper contains 38 sections, 3 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Overview of fairness methods explored in this work. We focus on in-context learning approaches, including fair prompt optimization and soft prompt tuning, fair few-shot examples, and self-refinement. For each method, we highlight the specific prompt components optimized in these approaches using different colors, while components of the prompts highlighted in gray do not change across strategies.
  • Figure 2: Meta-prompt used to iteratively refine fairness-instructions using a meta-LLM.
  • Figure 3: Accuracy and demographic parity for manually constructed fair prompts on Adult dataset, 4 models.
  • Figure 4: Accuracy and demographic parity for fair prompts optimized via a meta-LLM: red points denote Pareto-optimal fair prompts, the orange square shows the default model's performance, and black points depict a CatBoost model optimized on 50 examples with grid search.
  • Figure 5: Accuracy and demographic parity ratio metrics for prompts containing few-shot examples with varying number of positive examples across groups, evaluated on Adult dataset using Llama8B.
  • ...and 4 more figures