Overton Pluralistic Reinforcement Learning for Large Language Models

Yu Fu; Seongho Son; Ilija Bogunovic

Overton Pluralistic Reinforcement Learning for Large Language Models

Yu Fu, Seongho Son, Ilija Bogunovic

TL;DR

OP-GRPO (Overton Pluralistic Group Relative Policy Optimization), a reinforcement learning framework for implicit Overton Pluralism that enables a single large language model to produce pluralistic responses without explicit prompting or modular orchestration, is introduced.

Abstract

Existing alignment paradigms remain limited in capturing the pluralistic nature of human values. Overton Pluralism addresses this gap by generating responses with diverse perspectives from a single query. This paper introduces OP-GRPO (Overton Pluralistic Group Relative Policy Optimization), a reinforcement learning framework for implicit Overton Pluralism that enables a single large language model to produce pluralistic responses without explicit prompting or modular orchestration. Our workflow consists of two main steps. First, similarity estimator training fine-tunes a Sentence Transformer for Overton Pluralism tasks to provide more accurate coverage evaluation of generated responses. Second, OP-GRPO training incorporates this similarity estimator into a dual-reward system designed to ensure both broad coverage of genuine human perspectives and the uniqueness of each perspective, thereby promoting diversity. Empirical results demonstrate a "small models, big perspective coverage" effect. The trained Qwen2.5-3B-Instruct model surpasses a 20B GPT-OSS baseline with a 37.4 percent relative accuracy gain on a Natural Language Inference benchmark, and also outperforms a modular architecture baseline with a 19.1 percent relative improvement. Additional evaluations using GPT-4.1 as a large language model judge further confirm the robustness of the approach.

Overton Pluralistic Reinforcement Learning for Large Language Models

TL;DR

Abstract

Paper Structure (48 sections, 10 equations, 12 figures, 11 tables, 1 algorithm)

This paper contains 48 sections, 10 equations, 12 figures, 11 tables, 1 algorithm.

Introduction
Related Work
Preliminaries.
Reinforcement Learning from Human Feedback (RLHF).
Sentence Transformer (SBERT).
Overton Pluralistic Window.
Methodology
Dataset Preparation
OP Dataset Refinement.
OP-Triplet Dataset.
Constructing coverage estimator with Sentence Transformers
Construction of Reward Functions
Overton Pluralistic RLHF
GRPO objective.
Experiments
...and 33 more sections

Figures (12)

Figure 1: An overview of different examples aimed at generating Overton Pluralism windows. On the top left corner are real multi-perspective responses from diverse human groups to a given query. The figure then illustrates: (1) responses generated by a base LLM using an implicit OP prompt; (2) the same base LLM prompted with an explicit OP instruction; (3) a modular pluralism approach that combines outputs from different community-specific LLMs, summarized by a final LLM to form the OP window feng-etal-2024-modular; and (4) our RL-trained pipeline, which achieves the highest correct coverage of human reference perspectives in OP responses.
Figure 2: Example illustrating the perspective matching strategy when the number of candidate perspectives equals the number of human reference perspectives, using the OP-SBERT model as the similarity estimator.Top: In the initial approach, each candidate sentence is matched to the reference sentence with the highest similarity score, which can result in multiple candidates being linked to the same reference. Bottom:Left: The initial matching method, which often leads to a many-to-one matching problem. Right: The improved matching strategy using the MBGM algorithm, where once a perspective pair is selected, it is fixed, and the corresponding reference is excluded from subsequent matching steps—thereby preventing repeated pairings.
Figure 3: Comparison of initial and optimized matching strategies using different SBERT models under OP-reward evaluation. Each method shown here adopts its optimal threshold value.
Figure 4: Average output length of each model (sorted from high to low). Our trained models generate relatively shorter responses without explicitly constraining the output length. The output length of OP--GRPO models represents the final tokens extracted from the summary block.
Figure 4: Ablation on reward weight ratios ($\alpha_{\mathrm{cov}}:\alpha_{\mathrm{uniq}}$) for OP--GRPO. Results are reported on the $5p$ and $10p$ subtasks of the OP-V2 test set using Qwen2.5-3B-Instruct. Higher is better.
...and 7 more figures

Overton Pluralistic Reinforcement Learning for Large Language Models

TL;DR

Abstract

Overton Pluralistic Reinforcement Learning for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (12)