Table of Contents
Fetching ...

Aligning Crowd-sourced Human Feedback for Reinforcement Learning on Code Generation by Large Language Models

Man Fai Wong, Chee Wei Tan

TL;DR

This work addresses aligning crowd-sourced human feedback with reinforcement learning for text-to-code generation by large language models. It introduces cRLHF, a Bayesian-inference-based framework that aggregates multi-annotator feedback to compute an aligned reward score $s$ without training an extra reward model, and uses PPO-based optimization to fine-tune code-generating LLMs. Key contributions include a formal problem formulation, a probabilistic method for annotator reliability, and an optimization perspective leveraging proximal gradient methods with $\ell_1$ regularization, demonstrated on HumanEval and MBPP benchmarks with diverse baselines. The results show modest yet consistent improvements, especially for larger models, and the framework offers extensibility to domain-specific languages, potentially broadening the impact of AI-assisted programming.

Abstract

This paper studies how AI-assisted programming and large language models (LLM) improve software developers' ability via AI tools (LLM agents) like Github Copilot and Amazon CodeWhisperer, while integrating human feedback to enhance reinforcement learning (RLHF) with crowd-sourced computation to enhance text-to-code generation. Additionally, we demonstrate that our Bayesian optimization framework supports AI alignment in code generation by distributing the feedback collection burden, highlighting the value of collecting human feedback of good quality. Our empirical evaluations demonstrate the efficacy of this approach, showcasing how LLM agents can be effectively trained for improved text-to-code generation. Our Bayesian optimization framework can be designed for general domain-specific languages, promoting the alignment of large language model capabilities with human feedback in AI-assisted programming for code generation.

Aligning Crowd-sourced Human Feedback for Reinforcement Learning on Code Generation by Large Language Models

TL;DR

This work addresses aligning crowd-sourced human feedback with reinforcement learning for text-to-code generation by large language models. It introduces cRLHF, a Bayesian-inference-based framework that aggregates multi-annotator feedback to compute an aligned reward score without training an extra reward model, and uses PPO-based optimization to fine-tune code-generating LLMs. Key contributions include a formal problem formulation, a probabilistic method for annotator reliability, and an optimization perspective leveraging proximal gradient methods with regularization, demonstrated on HumanEval and MBPP benchmarks with diverse baselines. The results show modest yet consistent improvements, especially for larger models, and the framework offers extensibility to domain-specific languages, potentially broadening the impact of AI-assisted programming.

Abstract

This paper studies how AI-assisted programming and large language models (LLM) improve software developers' ability via AI tools (LLM agents) like Github Copilot and Amazon CodeWhisperer, while integrating human feedback to enhance reinforcement learning (RLHF) with crowd-sourced computation to enhance text-to-code generation. Additionally, we demonstrate that our Bayesian optimization framework supports AI alignment in code generation by distributing the feedback collection burden, highlighting the value of collecting human feedback of good quality. Our empirical evaluations demonstrate the efficacy of this approach, showcasing how LLM agents can be effectively trained for improved text-to-code generation. Our Bayesian optimization framework can be designed for general domain-specific languages, promoting the alignment of large language model capabilities with human feedback in AI-assisted programming for code generation.

Paper Structure

This paper contains 18 sections, 17 equations, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: The schematic diagram of traditional RLHF and aligning crowd-sourced human feedback strategy on code generation. The upper part of the Figure shows the traditional RLHF method, which employs an annotator to rank the generated outputs and forward them to the reward model. The lower part presents how crowd-sourced RLHF strategy can be done with multiple annotators, which automatically compute the consensual ranked output in terms of ranking or reward scores.
  • Figure 2: An overview of the cRLHF framework with the reinforcement learning algorithm utilizing proximal policy optimization (PPO) for code generation.
  • Figure 3: The overview of the process for evaluating $p_i$ across all annotators. Presented with descriptions and code examples, as depicted on the left-hand side of the figure, each annotator, with their respective $p_i$ values, is prompted to identify errors within the code. The system then assesses the accuracy of the annotations and proceeds to update the $p_i$ values accordingly.
  • Figure 4: The user interface on online crowdsourcing platform for code annotation tasks. Annotators will receive various code snippets generated by LLMs alongside their corresponding descriptions. Their task involves annotating lines of the program that contain errors. A detailed guide on how to use this code annotation tool is also provided for annotators.
  • Figure 5: Comparative performance analysis of the baseline and fine-tuned models on the HumaEval task (left) and the MBPP task (right). Each point represents the performance at a specific model, with lines connecting the baseline and fine-tuned results. Green lines indicate an improvement in performance after fine-tuning, while red lines indicate a decrease