Table of Contents
Fetching ...

Reinforcement learning for question answering in programming domain using public community scoring as a human feedback

Alexey Gorbatovski, Sergey Kovalchuk

TL;DR

This work investigates reinforcement learning from human feedback (RLHF) to adapt a small GPT-Neo 125M model for programming Community Question Answering, using Stack Overflow scores as gold-standard-like feedback and PPO-based fine-tuning. Two reward-model strategies—regression and contrastive—are trained on data augmented with model-generated answers, and the RLHF pipeline is evaluated against supervised fine-tuning and a larger 2.7B baseline across automatic metrics and human MRR. The findings show RLHF can close much of the gap with larger models for programming tasks, while revealing substantial divergences between traditional linguistic metrics (e.g., BertScore, SacreBLEU) and reward-based preferences, pointing to the need for domain-specific evaluation. Overall, the study demonstrates the viability of domain-aware RLHF with public community data for improving small LLMs and highlights future directions toward larger models and better domain-specific evaluation methodologies.

Abstract

In this study, we investigate the enhancement of the GPT Neo 125M performance in Community Question Answering (CQA) with a focus on programming, through the integration of Reinforcement Learning from Human Feedback (RLHF) and the utilization of scores from Stack Overflow. Two distinct reward model training strategies are employed for fine-tuning with Proximal Policy Optimization (PPO). Notably, the improvements in performance achieved through this method are comparable to those of GPT Neo 2.7B parameter variant. Additionally, an auxiliary scoring mechanism is introduced, which demonstrates the limitations of conventional linguistic metrics in evaluating responses in the programming domain. Through accurate analysis, this paper looks at the divergence between traditional linguistic metrics and our human-preferences-based reward model, underscoring the imperative for domain-specific evaluation methods. By elucidating the complexities involved in applying RLHF to programming CQA and accentuating the significance of context-aware evaluation, this study contributes to the ongoing efforts in refining Large Language Models through focused human feedback.

Reinforcement learning for question answering in programming domain using public community scoring as a human feedback

TL;DR

This work investigates reinforcement learning from human feedback (RLHF) to adapt a small GPT-Neo 125M model for programming Community Question Answering, using Stack Overflow scores as gold-standard-like feedback and PPO-based fine-tuning. Two reward-model strategies—regression and contrastive—are trained on data augmented with model-generated answers, and the RLHF pipeline is evaluated against supervised fine-tuning and a larger 2.7B baseline across automatic metrics and human MRR. The findings show RLHF can close much of the gap with larger models for programming tasks, while revealing substantial divergences between traditional linguistic metrics (e.g., BertScore, SacreBLEU) and reward-based preferences, pointing to the need for domain-specific evaluation. Overall, the study demonstrates the viability of domain-aware RLHF with public community data for improving small LLMs and highlights future directions toward larger models and better domain-specific evaluation methodologies.

Abstract

In this study, we investigate the enhancement of the GPT Neo 125M performance in Community Question Answering (CQA) with a focus on programming, through the integration of Reinforcement Learning from Human Feedback (RLHF) and the utilization of scores from Stack Overflow. Two distinct reward model training strategies are employed for fine-tuning with Proximal Policy Optimization (PPO). Notably, the improvements in performance achieved through this method are comparable to those of GPT Neo 2.7B parameter variant. Additionally, an auxiliary scoring mechanism is introduced, which demonstrates the limitations of conventional linguistic metrics in evaluating responses in the programming domain. Through accurate analysis, this paper looks at the divergence between traditional linguistic metrics and our human-preferences-based reward model, underscoring the imperative for domain-specific evaluation methods. By elucidating the complexities involved in applying RLHF to programming CQA and accentuating the significance of context-aware evaluation, this study contributes to the ongoing efforts in refining Large Language Models through focused human feedback.
Paper Structure (23 sections, 3 equations, 6 figures, 3 tables, 2 algorithms)

This paper contains 23 sections, 3 equations, 6 figures, 3 tables, 2 algorithms.

Figures (6)

  • Figure 1: The general schema of Reinforcement Learning from Human Feedback for programming Q&A
  • Figure 2: The general schema of evaluation approach
  • Figure 3: Graphs of dependencies of metric values on the number of k attempts to generate
  • Figure A1: Spearman correlation coefficients for Base model
  • Figure A2: Spearman correlation coefficients for SFT model
  • ...and 1 more figures