Table of Contents
Fetching ...

Applying RLAIF for Code Generation with API-usage in Lightweight LLMs

Sujan Dutta, Sayantan Mahinder, Raviteja Anantha, Bortik Bandyopadhyay

TL;DR

This work tackles the problem of generating code that correctly uses external APIs in lightweight LLMs, where hallucinations are common. It introduces an RLAIF pipeline that uses AI feedback from a larger model to train a reward model and perform PPO-based RL on a 780M model. On the Gorilla dataset, the approach yields improvements in executability, CodeBLEU, and AST, and a 780M model trained with RLAIF can outperform a much larger fine-tuned baseline. This demonstrates that AI feedback can substitute expensive human labels and enable practical API-aware code generation for smaller models.

Abstract

Reinforcement Learning from AI Feedback (RLAIF) has demonstrated significant potential across various domains, including mitigating harm in LLM outputs, enhancing text summarization, and mathematical reasoning. This paper introduces an RLAIF framework for improving the code generation abilities of lightweight (<1B parameters) LLMs. We specifically focus on code generation tasks that require writing appropriate API calls, which is challenging due to the well-known issue of hallucination in LLMs. Our framework extracts AI feedback from a larger LLM (e.g., GPT-3.5) through a specialized prompting strategy and uses this data to train a reward model towards better alignment from smaller LLMs. We run our experiments on the Gorilla dataset and meticulously assess the quality of the model-generated code across various metrics, including AST, ROUGE, and Code-BLEU, and develop a pipeline to compute its executability rate accurately. Our approach significantly enhances the fine-tuned LLM baseline's performance, achieving a 4.5% improvement in executability rate. Notably, a smaller LLM model (780M parameters) trained with RLAIF surpasses a much larger fine-tuned baseline with 7B parameters, achieving a 1.0% higher code executability rate.

Applying RLAIF for Code Generation with API-usage in Lightweight LLMs

TL;DR

This work tackles the problem of generating code that correctly uses external APIs in lightweight LLMs, where hallucinations are common. It introduces an RLAIF pipeline that uses AI feedback from a larger model to train a reward model and perform PPO-based RL on a 780M model. On the Gorilla dataset, the approach yields improvements in executability, CodeBLEU, and AST, and a 780M model trained with RLAIF can outperform a much larger fine-tuned baseline. This demonstrates that AI feedback can substitute expensive human labels and enable practical API-aware code generation for smaller models.

Abstract

Reinforcement Learning from AI Feedback (RLAIF) has demonstrated significant potential across various domains, including mitigating harm in LLM outputs, enhancing text summarization, and mathematical reasoning. This paper introduces an RLAIF framework for improving the code generation abilities of lightweight (<1B parameters) LLMs. We specifically focus on code generation tasks that require writing appropriate API calls, which is challenging due to the well-known issue of hallucination in LLMs. Our framework extracts AI feedback from a larger LLM (e.g., GPT-3.5) through a specialized prompting strategy and uses this data to train a reward model towards better alignment from smaller LLMs. We run our experiments on the Gorilla dataset and meticulously assess the quality of the model-generated code across various metrics, including AST, ROUGE, and Code-BLEU, and develop a pipeline to compute its executability rate accurately. Our approach significantly enhances the fine-tuned LLM baseline's performance, achieving a 4.5% improvement in executability rate. Notably, a smaller LLM model (780M parameters) trained with RLAIF surpasses a much larger fine-tuned baseline with 7B parameters, achieving a 1.0% higher code executability rate.
Paper Structure (13 sections, 1 equation, 2 figures, 2 tables)

This paper contains 13 sections, 1 equation, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Schematic diagram of the proposed framework. Step 1 is to fine-tune a base model on the dataset. In step 2, we score the $\mathcal{M}_\textit{SFT}$ generated outputs based on the GPT-3.5 feedback using the technique described in section \ref{['sec:method']}. Using this score, we prepare preference data and train a reward model. Finally, in step 3, we use RL to fine-tune $\mathcal{M}_\textit{SFT}$ where $\mathcal{M}_\textit{reward}$ provides the reward.
  • Figure 2: Example code generated by different models for the same instruction. In the generations of $\mathcal{M}_\textit{Gorilla}$ and $\mathcal{M}_\textit{SFT}$ the variable russian_text is undefined and hence will result in an error. Whereas $\mathcal{M}_\textit{RL}$ defines the variable text before using it.