Table of Contents
Fetching ...

Iterative Self-Training for Code Generation via Reinforced Re-Ranking

Nikita Sorokin, Ivan Sedykh, Valentin Malykh

TL;DR

This work introduces RewardRanker, a reranker-and-generator pair trained via an iterative reinforcement learning loop using Proximal Policy Optimization to refine a reward model and improve code generation across multiple languages. The training pipeline combines supervised fine-tuning, a Bradley-Terry-inspired reward objective, and PPO-driven candidate generation with an iterative self-training cycle that incorporates hard negatives. On the MultiPL-E benchmark, a $13.4B$-parameter RewardRanker setup (with $6.7B$-parameter components) achieves strong results, surpassing a $33B$ model and approaching GPT-4, with a notable advantage in C++. MBPP experiments further show superiority over LEVER baselines, highlighting the method’s robustness and efficiency. The approach offers a resource-efficient path to high-quality multilingual code generation through iteratively refined reranking.

Abstract

Generating high-quality code that solves complex programming tasks is challenging, especially with current decoder-based models that produce highly stochastic outputs. In code generation, even minor errors can easily break the entire solution. Leveraging multiple sampled solutions can significantly improve the overall output quality. One effective way to enhance code generation is by pairing a code generation model with a reranker model, which selects the best solution from the generated samples. We propose a novel iterative self-training approach for self-training reranker models using Proximal Policy Optimization (PPO), aimed at improving both reranking accuracy and the overall code generation process. Unlike traditional PPO approaches, where the focus is on optimizing a generative model with a reward model, our approach emphasizes the development of a robust reward/reranking model. This model improves the quality of generated code through reranking and addresses problems and errors that the reward model might overlook during PPO alignment with the reranker. Our method iteratively refines the training dataset by re-evaluating outputs, identifying high-scoring negative examples, and incorporating them into the training loop, that boosting model performance. Our evaluation on the MultiPL-E dataset demonstrates that our 13.4B parameter model outperforms a 33B model in code generation quality while being three times faster. Moreover, it achieves performance comparable to GPT-4 and surpasses it in one programming language.

Iterative Self-Training for Code Generation via Reinforced Re-Ranking

TL;DR

This work introduces RewardRanker, a reranker-and-generator pair trained via an iterative reinforcement learning loop using Proximal Policy Optimization to refine a reward model and improve code generation across multiple languages. The training pipeline combines supervised fine-tuning, a Bradley-Terry-inspired reward objective, and PPO-driven candidate generation with an iterative self-training cycle that incorporates hard negatives. On the MultiPL-E benchmark, a -parameter RewardRanker setup (with -parameter components) achieves strong results, surpassing a model and approaching GPT-4, with a notable advantage in C++. MBPP experiments further show superiority over LEVER baselines, highlighting the method’s robustness and efficiency. The approach offers a resource-efficient path to high-quality multilingual code generation through iteratively refined reranking.

Abstract

Generating high-quality code that solves complex programming tasks is challenging, especially with current decoder-based models that produce highly stochastic outputs. In code generation, even minor errors can easily break the entire solution. Leveraging multiple sampled solutions can significantly improve the overall output quality. One effective way to enhance code generation is by pairing a code generation model with a reranker model, which selects the best solution from the generated samples. We propose a novel iterative self-training approach for self-training reranker models using Proximal Policy Optimization (PPO), aimed at improving both reranking accuracy and the overall code generation process. Unlike traditional PPO approaches, where the focus is on optimizing a generative model with a reward model, our approach emphasizes the development of a robust reward/reranking model. This model improves the quality of generated code through reranking and addresses problems and errors that the reward model might overlook during PPO alignment with the reranker. Our method iteratively refines the training dataset by re-evaluating outputs, identifying high-scoring negative examples, and incorporating them into the training loop, that boosting model performance. Our evaluation on the MultiPL-E dataset demonstrates that our 13.4B parameter model outperforms a 33B model in code generation quality while being three times faster. Moreover, it achieves performance comparable to GPT-4 and surpasses it in one programming language.

Paper Structure

This paper contains 7 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Iterative Self-Training Workflow for RewardRanker. The process starts with supervised fine-tuning (A), followed by training the RewardRanker model (B). A PPO-based model (C) is then trained, generating new examples that are evaluated to produce both positive and hard negative samples (D). These samples are fed back into the process for further refinement and retraining (E), completing the iterative loop.