Improving Neutral Point-of-View Generation with Data- and Parameter-Efficient RL
Jessica Hoffmann, Christiane Ahlheim, Zac Yu, Aria Walfrand, Jarvis Jin, Marie Tano, Ahmad Beirami, Erin van Liemt, Nithum Thain, Hakim Sidahmed, Lucas Dixon
TL;DR
The paper addresses the challenge of generating Neutral Point of View (NPOV) responses to sensitive topics by proposing Parameter-Efficient Reinforcement Learning (PE-RL) with LoRA adapters. It introduces SHQ-NPOV, a small but high-quality dataset, and demonstrates that PE-RL substantially improves NPOV quality and linguistic depth compared with baselines like LoRA SFT, SFT, RLHF, and Best-of-10, with strong generalization to out-of-distribution topics. A key finding is that data-efficient PE-RL can avoid overfitting in a low-data regime and still generalize, especially when initialized from a LoRA SFT checkpoint. The authors release SHQ-NPOV and a reproducible methodology for iterative dataset creation, positioning PE-RL as a practical path toward safer, more informative multi-perspective AI responses, while acknowledging open questions about neutrality definitions, data sourcing, and environmental costs.
Abstract
The paper shows that parameter-efficient reinforcement learning (PE-RL) is a highly effective training regime to improve large language models' (LLMs) ability to answer queries on sensitive topics with a Neutral Point of View (NPOV), i.e. to provide significantly more informative, diverse and impartial answers. This is shown by evaluating PE-RL and multiple strong baselines-including LoRA finetuning (strongest baseline), SFT and RLHF. PE-RL not only improves on overall NPOV quality compared to the strongest baseline ($97.06\%\rightarrow 99.08\%$), but also scores much higher on features linguists identify as key to separating sufficient answers from "great'' answers ($60.25\%\rightarrow 85.21\%$ for presence of supportive details, $68.74\%\rightarrow 91.43\%$ for absence of oversimplification). A qualitative analysis corroborates this. Moreover, our evaluation also finds a key property of PE-RL for this task: unlike methods that update all parameters, it generalises out of topic. Finally, to enable further studies we also release the dataset, SHQ-NPOV, and provide a methodology to create such datasets through iterative rounds of human peer-critique and annotator training.
