Table of Contents
Fetching ...

Intent-Aware Self-Correction for Mitigating Social Biases in Large Language Models

Panatchakorn Anantaprayoon, Masahiro Kaneko, Naoaki Okazaki

TL;DR

The paper introduces an intent-aware Self-Correction framework for debiasing LLMs by explicitly signaling debiasing goals at three stages: instruction, response, and feedback. It combines an explicit debiasing prompt with Chain-of-Thought reasoning and multi-aspect feedback to refine outputs, and it uses an early-stopping mechanism driven by feedback scores. Across nine BBQ bias categories and multiple models, the approach yields more robust and consistent debiasing than baselines, with cross-model feedback often outperforming same-model feedback, though effectiveness depends on the bias level of the response generator. The findings highlight feedback quality as a key driver of refinement success and suggest practical guidance for deploying debiasing Self-Correction, while noting limitations in prompt design and evaluation scope. Overall, the work advances a principled, intent-aware approach to reducing social biases in LLMs with actionable insights for future research and application.

Abstract

Self-Correction based on feedback improves the output quality of Large Language Models (LLMs). Moreover, as Self-Correction functions like the slow and conscious System-2 thinking from cognitive psychology's perspective, it can potentially reduce LLMs' social biases. LLMs are sensitive to contextual ambiguities and inconsistencies; therefore, explicitly communicating their intentions during interactions when applying Self-Correction for debiasing is crucial. In this study, we demonstrate that clarifying intentions is essential for effectively reducing biases in LLMs through Self-Correction. We divide the components needed for Self-Correction into three parts: instruction, response, and feedback, and clarify intentions at each component. We incorporate an explicit debiasing prompt to convey the intention of bias mitigation from the instruction for response generation. In the response, we use Chain-of-Thought (CoT) to clarify the reasoning process. In the feedback, we define evaluation aspects necessary for debiasing and propose clear feedback through multi-aspect critiques and scoring. Through experiments, we demonstrate that self-correcting CoT responses obtained from a debiasing prompt based on multi-aspect feedback can reduce biased responses more robustly and consistently than the baselines. We also find the variation in debiasing efficacy when using models with different bias levels or separating models for response and feedback generation.

Intent-Aware Self-Correction for Mitigating Social Biases in Large Language Models

TL;DR

The paper introduces an intent-aware Self-Correction framework for debiasing LLMs by explicitly signaling debiasing goals at three stages: instruction, response, and feedback. It combines an explicit debiasing prompt with Chain-of-Thought reasoning and multi-aspect feedback to refine outputs, and it uses an early-stopping mechanism driven by feedback scores. Across nine BBQ bias categories and multiple models, the approach yields more robust and consistent debiasing than baselines, with cross-model feedback often outperforming same-model feedback, though effectiveness depends on the bias level of the response generator. The findings highlight feedback quality as a key driver of refinement success and suggest practical guidance for deploying debiasing Self-Correction, while noting limitations in prompt design and evaluation scope. Overall, the work advances a principled, intent-aware approach to reducing social biases in LLMs with actionable insights for future research and application.

Abstract

Self-Correction based on feedback improves the output quality of Large Language Models (LLMs). Moreover, as Self-Correction functions like the slow and conscious System-2 thinking from cognitive psychology's perspective, it can potentially reduce LLMs' social biases. LLMs are sensitive to contextual ambiguities and inconsistencies; therefore, explicitly communicating their intentions during interactions when applying Self-Correction for debiasing is crucial. In this study, we demonstrate that clarifying intentions is essential for effectively reducing biases in LLMs through Self-Correction. We divide the components needed for Self-Correction into three parts: instruction, response, and feedback, and clarify intentions at each component. We incorporate an explicit debiasing prompt to convey the intention of bias mitigation from the instruction for response generation. In the response, we use Chain-of-Thought (CoT) to clarify the reasoning process. In the feedback, we define evaluation aspects necessary for debiasing and propose clear feedback through multi-aspect critiques and scoring. Through experiments, we demonstrate that self-correcting CoT responses obtained from a debiasing prompt based on multi-aspect feedback can reduce biased responses more robustly and consistently than the baselines. We also find the variation in debiasing efficacy when using models with different bias levels or separating models for response and feedback generation.

Paper Structure

This paper contains 33 sections, 6 equations, 1 figure, 11 tables.

Figures (1)

  • Figure 1: Explicit instruction, response, and feedback are crucial for effective Self-Correction. Here, a debiasing prompt is used to clarify the instruction, CoT is used to clarify the response's reasoning, and multi-aspect critiques and scoring are used to clarify the feedback.