Table of Contents
Fetching ...

Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO

Jinquan Zheng, Jia Yuan, Jiacheng Yao, Chenyang Gu, Pujun Zheng, Guoxiu He

Abstract

Large language models (LLMs) used for multiple-choice and pairwise evaluation tasks often exhibit selection bias due to non-semantic factors like option positions and label symbols. Existing inference-time debiasing is costly and may harm reasoning, while pointwise training ignores that the same question should yield consistent answers across permutations. To address this issue, we propose Permutation-Aware Group Relative Policy Optimization (PA-GRPO), which mitigates selection bias by enforcing permutation-consistent semantic reasoning. PA-GRPO constructs a permutation group for each instance by generating multiple candidate permutations, and optimizes the model using two complementary mechanisms: (1) cross-permutation advantage, which computes advantages relative to the mean reward over all permutations of the same instance, and (2) consistency-aware reward, which encourages the model to produce consistent decisions across different permutations. Experimental results demonstrate that PA-GRPO outperforms strong baselines across seven benchmarks, substantially reducing selection bias while maintaining high overall performance. The code will be made available on Github (https://github.com/ECNU-Text-Computing/PA-GRPO).

Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO

Abstract

Large language models (LLMs) used for multiple-choice and pairwise evaluation tasks often exhibit selection bias due to non-semantic factors like option positions and label symbols. Existing inference-time debiasing is costly and may harm reasoning, while pointwise training ignores that the same question should yield consistent answers across permutations. To address this issue, we propose Permutation-Aware Group Relative Policy Optimization (PA-GRPO), which mitigates selection bias by enforcing permutation-consistent semantic reasoning. PA-GRPO constructs a permutation group for each instance by generating multiple candidate permutations, and optimizes the model using two complementary mechanisms: (1) cross-permutation advantage, which computes advantages relative to the mean reward over all permutations of the same instance, and (2) consistency-aware reward, which encourages the model to produce consistent decisions across different permutations. Experimental results demonstrate that PA-GRPO outperforms strong baselines across seven benchmarks, substantially reducing selection bias while maintaining high overall performance. The code will be made available on Github (https://github.com/ECNU-Text-Computing/PA-GRPO).
Paper Structure (50 sections, 19 equations, 3 figures, 5 tables, 1 algorithm)

This paper contains 50 sections, 19 equations, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: Example of selection bias in discrete-choice settings: swapping the order of answers changes only non-semantic factors.
  • Figure 2: Comparison between Standard GRPO (left) and the proposed PA-GRPO (right). Standard GRPO treats permuted prompts as independent samples, suffering from permutation blindness where inconsistency goes unpunished. In contrast, PA-GRPO organizes samples into Permutation Groups. It introduces (1) a Cross-Permutation Advantage (using the permutation group mean as a baseline) and (2) a Consistency-Aware Reward to explicitly enforce semantic invariance across different permutations of the same instance.
  • Figure 3: Impact of Permutation Group Size ($P$).