Reinforcement learning for question answering in programming domain using public community scoring as a human feedback
Alexey Gorbatovski, Sergey Kovalchuk
TL;DR
This work investigates reinforcement learning from human feedback (RLHF) to adapt a small GPT-Neo 125M model for programming Community Question Answering, using Stack Overflow scores as gold-standard-like feedback and PPO-based fine-tuning. Two reward-model strategies—regression and contrastive—are trained on data augmented with model-generated answers, and the RLHF pipeline is evaluated against supervised fine-tuning and a larger 2.7B baseline across automatic metrics and human MRR. The findings show RLHF can close much of the gap with larger models for programming tasks, while revealing substantial divergences between traditional linguistic metrics (e.g., BertScore, SacreBLEU) and reward-based preferences, pointing to the need for domain-specific evaluation. Overall, the study demonstrates the viability of domain-aware RLHF with public community data for improving small LLMs and highlights future directions toward larger models and better domain-specific evaluation methodologies.
Abstract
In this study, we investigate the enhancement of the GPT Neo 125M performance in Community Question Answering (CQA) with a focus on programming, through the integration of Reinforcement Learning from Human Feedback (RLHF) and the utilization of scores from Stack Overflow. Two distinct reward model training strategies are employed for fine-tuning with Proximal Policy Optimization (PPO). Notably, the improvements in performance achieved through this method are comparable to those of GPT Neo 2.7B parameter variant. Additionally, an auxiliary scoring mechanism is introduced, which demonstrates the limitations of conventional linguistic metrics in evaluating responses in the programming domain. Through accurate analysis, this paper looks at the divergence between traditional linguistic metrics and our human-preferences-based reward model, underscoring the imperative for domain-specific evaluation methods. By elucidating the complexities involved in applying RLHF to programming CQA and accentuating the significance of context-aware evaluation, this study contributes to the ongoing efforts in refining Large Language Models through focused human feedback.
