Table of Contents
Fetching ...

Metric-Fair Prompting: Treating Similar Samples Similarly

Jing Wang, Jie Shen, Xing Niu, Tong Zhang, Jeremy Weiss

TL;DR

The paper presents Metric-Fair Prompting, a fairness-aware prompting framework that enforces a metric-based Lipschitz constraint to ensure similar (question, option) items yield similar scores in MedQA. It introduces a joint-inference protocol over similar questions to promote cross-item consistency and reduce near-boundary errors, guided by clinically decisive features and a margin-based scoring mechanism. Empirical evaluation on MedQA-US shows substantial accuracy gains over standard single-item prompting, suggesting that fairness-guided, confidence-oriented reasoning can improve LLM performance in high-stakes clinical QA. The work integrates embedding-based similarity with constraint-based reasoning to enhance robustness and reliability in medical question answering.

Abstract

We introduce \emph{Metric-Fair Prompting}, a fairness-aware prompting framework that guides large language models (LLMs) to make decisions under metric-fairness constraints. In the application of multiple-choice medical question answering, each {(question, option)} pair is treated as a binary instance with label $+1$ (correct) or $-1$ (incorrect). To promote {individual fairness}~--~treating similar instances similarly~--~we compute question similarity using NLP embeddings and solve items in \emph{joint pairs of similar questions} rather than in isolation. The prompt enforces a global decision protocol: extract decisive clinical features, map each \((\text{question}, \text{option})\) to a score $f(x)$ that acts as confidence, and impose a Lipschitz-style constraint so that similar inputs receive similar scores and, hence, consistent outputs. Evaluated on the {MedQA (US)} benchmark, Metric-Fair Prompting is shown to improve performance over standard single-item prompting, demonstrating that fairness-guided, confidence-oriented reasoning can enhance LLM accuracy on high-stakes clinical multiple-choice questions.

Metric-Fair Prompting: Treating Similar Samples Similarly

TL;DR

The paper presents Metric-Fair Prompting, a fairness-aware prompting framework that enforces a metric-based Lipschitz constraint to ensure similar (question, option) items yield similar scores in MedQA. It introduces a joint-inference protocol over similar questions to promote cross-item consistency and reduce near-boundary errors, guided by clinically decisive features and a margin-based scoring mechanism. Empirical evaluation on MedQA-US shows substantial accuracy gains over standard single-item prompting, suggesting that fairness-guided, confidence-oriented reasoning can improve LLM performance in high-stakes clinical QA. The work integrates embedding-based similarity with constraint-based reasoning to enhance robustness and reliability in medical question answering.

Abstract

We introduce \emph{Metric-Fair Prompting}, a fairness-aware prompting framework that guides large language models (LLMs) to make decisions under metric-fairness constraints. In the application of multiple-choice medical question answering, each {(question, option)} pair is treated as a binary instance with label (correct) or (incorrect). To promote {individual fairness}~--~treating similar instances similarly~--~we compute question similarity using NLP embeddings and solve items in \emph{joint pairs of similar questions} rather than in isolation. The prompt enforces a global decision protocol: extract decisive clinical features, map each \((\text{question}, \text{option})\) to a score that acts as confidence, and impose a Lipschitz-style constraint so that similar inputs receive similar scores and, hence, consistent outputs. Evaluated on the {MedQA (US)} benchmark, Metric-Fair Prompting is shown to improve performance over standard single-item prompting, demonstrating that fairness-guided, confidence-oriented reasoning can enhance LLM accuracy on high-stakes clinical multiple-choice questions.

Paper Structure

This paper contains 24 sections, 2 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Geometric view of Metric-Fair Prompting. Questions 1 and 2 are highly similar (small metric distance $d$); their correct options lie on the same side of the decision boundary with nearby margins $d$ and $d+s$ ($s>0$ small). Question 3 is less similar to Question 1 (larger distance $d+s$, $p>0$): its correct option remains in the same half-space but with a more separated margin $d+s+p$. The metric-fair (Lipschitz-like) constraint encourages similar items to receive similar scores and thus consistent decisions.
  • Figure 2: Correlations between the options of Question 1 and Question 2 in Table \ref{['tab:medqa-sim']} by Qwen3-8B embedding.
  • Figure 3: Correlations between questions and options from Table \ref{['tab:medqa-sim']} by Qwen3-8B embedding.
  • Figure 4: Example of LLM output for two questions with cosine similarity 0. 9612 (embedding by Qwen3-4B embedding) given our prompt. The two patients are identical in clinical features, lab results and biopsy findings. The only difference is the age, which is not a distinguishing factor. Hence the correct option is same, "Adverse effect of anesthetic".
  • Figure 5: Example of LLM output for two questions with cosine similarity 0.9020 (embedding by Qwen3-4B embedding) given our prompt. Both questions refer to the same study abstract. The first question is about interpretation of standard error (which relates to sample size and variability). The second question is about statistical method to determine significance of group differences.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Definition 1: Lipschitz mapping