Table of Contents
Fetching ...

Uncovering Biases with Reflective Large Language Models

Edward Y. Chang

TL;DR

This work presents the Reflective LLM Dialogue Framework RLDF, which leverages structured adversarial dialogues between multiple instances of a single LLM or different LLMs to uncover diverse perspectives and correct inconsistencies, and enables systematic bias detection through conditional statistics, information theory, and divergence metrics.

Abstract

Biases and errors in human-labeled data present significant challenges for machine learning, especially in supervised learning reliant on potentially flawed ground truth data. These flaws, including diagnostic errors and societal biases, risk being propagated and amplified through models trained using maximum likelihood estimation. We present the Reflective LLM Dialogue Framework RLDF, which leverages structured adversarial dialogues between multiple instances of a single LLM or different LLMs to uncover diverse perspectives and correct inconsistencies. By conditioning LLMs to adopt opposing stances, RLDF enables systematic bias detection through conditional statistics, information theory, and divergence metrics. Experiments show RLDF successfully identifies potential biases in public content while exposing limitations in human-labeled data. Our framework supports measurable progress tracking and explainable remediation actions, offering a scalable approach for improving content neutrality through transparent, multi-perspective analysis.

Uncovering Biases with Reflective Large Language Models

TL;DR

This work presents the Reflective LLM Dialogue Framework RLDF, which leverages structured adversarial dialogues between multiple instances of a single LLM or different LLMs to uncover diverse perspectives and correct inconsistencies, and enables systematic bias detection through conditional statistics, information theory, and divergence metrics.

Abstract

Biases and errors in human-labeled data present significant challenges for machine learning, especially in supervised learning reliant on potentially flawed ground truth data. These flaws, including diagnostic errors and societal biases, risk being propagated and amplified through models trained using maximum likelihood estimation. We present the Reflective LLM Dialogue Framework RLDF, which leverages structured adversarial dialogues between multiple instances of a single LLM or different LLMs to uncover diverse perspectives and correct inconsistencies. By conditioning LLMs to adopt opposing stances, RLDF enables systematic bias detection through conditional statistics, information theory, and divergence metrics. Experiments show RLDF successfully identifies potential biases in public content while exposing limitations in human-labeled data. Our framework supports measurable progress tracking and explainable remediation actions, offering a scalable approach for improving content neutrality through transparent, multi-perspective analysis.
Paper Structure (38 sections, 15 equations, 4 figures, 9 tables)

This paper contains 38 sections, 15 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Specifications of Algorithm $\mathsf{EVINCE}$. Key points: 1) Asymmetric Start: In Step #1, LLM$_A$ initiates the debate with opening arguments based solely on the given information, while LLM$_B$ begins with access to LLM$_A$'s prediction and arguments, enabling it to refute. The contentiousness level is initially set to high. 2) Termination Criteria: The while loop in Step #2 evaluates multiple factors, including Wasserstein distance, divergence metrics, and argument quality. The dialogue terminates if significant progress is no longer observed. 3) Contentiousness Modulation: In Step #2.2, contentiousness is updated based on divergence metrics and Wasserstein distance, as detailed in the modulation formula provided in Appendix F. 4) Joint Distribution Generation: Step #3 produces a joint distribution weighted by the quality of reasoning.
  • Figure 2: Distances Between D, R, and S.
  • Figure 3: Bias Rating Distributions Show Strong Biases. D is more negative on how D scandals were reported (the sub-figure on the left), R is more negative on how R scandals were reported (the sub-figure on the right).
  • Figure 4: Convergence of all metrics, Wasserstein, normalized mutual information, normalized cross entropy

Theorems & Definitions (1)

  • proof