Table of Contents
Fetching ...

AXOLOTL: Fairness through Assisted Self-Debiasing of Large Language Model Outputs

Sana Ebrahimi, Kaiwen Chen, Abolfazl Asudeh, Gautam Das, Nick Koudas

TL;DR

AXOLOTL introduces a black-box, post-processing debiasing framework that detects bias orientations via embeddings, proposes pleasant resolutions, and guides LLM outputs to self-debias using public APIs. It operates without retraining and is model- and task-agnostic, enabling cross-model and cross-task applicability. Across Stereoset, WinoBias, and BOLD, the method reduces stereotype and negative regard scores and improves sentiment, with smaller models sometimes outperforming larger ones. Performance depends on embedding quality and the relevance of the supplied bias-term dictionaries, and online access remains a constraint.

Abstract

Pre-trained Large Language Models (LLMs) have significantly advanced natural language processing capabilities but are susceptible to biases present in their training data, leading to unfair outcomes in various applications. While numerous strategies have been proposed to mitigate bias, they often require extensive computational resources and may compromise model performance. In this work, we introduce AXOLOTL, a novel post-processing framework, which operates agnostically across tasks and models, leveraging public APIs to interact with LLMs without direct access to internal parameters. Through a three-step process resembling zero-shot learning, AXOLOTL identifies biases, proposes resolutions, and guides the model to self-debias its outputs. This approach minimizes computational costs and preserves model performance, making AXOLOTL a promising tool for debiasing LLM outputs with broad applicability and ease of use.

AXOLOTL: Fairness through Assisted Self-Debiasing of Large Language Model Outputs

TL;DR

AXOLOTL introduces a black-box, post-processing debiasing framework that detects bias orientations via embeddings, proposes pleasant resolutions, and guides LLM outputs to self-debias using public APIs. It operates without retraining and is model- and task-agnostic, enabling cross-model and cross-task applicability. Across Stereoset, WinoBias, and BOLD, the method reduces stereotype and negative regard scores and improves sentiment, with smaller models sometimes outperforming larger ones. Performance depends on embedding quality and the relevance of the supplied bias-term dictionaries, and online access remains a constraint.

Abstract

Pre-trained Large Language Models (LLMs) have significantly advanced natural language processing capabilities but are susceptible to biases present in their training data, leading to unfair outcomes in various applications. While numerous strategies have been proposed to mitigate bias, they often require extensive computational resources and may compromise model performance. In this work, we introduce AXOLOTL, a novel post-processing framework, which operates agnostically across tasks and models, leveraging public APIs to interact with LLMs without direct access to internal parameters. Through a three-step process resembling zero-shot learning, AXOLOTL identifies biases, proposes resolutions, and guides the model to self-debias its outputs. This approach minimizes computational costs and preserves model performance, making AXOLOTL a promising tool for debiasing LLM outputs with broad applicability and ease of use.
Paper Structure (19 sections, 4 equations, 1 figure, 5 tables)