Table of Contents
Fetching ...

ACECode: A Reinforcement Learning Framework for Aligning Code Efficiency and Correctness in Code Language Models

Chengran Yang, Hong Jin Kang, Jieke Shi, David Lo

TL;DR

This work targets the problem that CodeLLMs often produce functionally correct but runtime-inefficient code. It introduces ACECode, a reinforcement learning fine-tuning framework that uses a training-free rewarder derived from code execution to jointly optimize code efficiency $G_e$ and correctness $G_c$ via Proximal Policy Optimization. ACECode demonstrates significant gains in code correctness (pass@1) and efficiency (ECC and GET) across four state-of-the-art open-source CodeLLMs, outperforming original, instruction-tuned, and PIE baselines, with improvements of up to $14.51\%$ in pass@1 and up to $10.86\%$ in GET, while reducing runtime in $65\%-72\%$ of cases. The framework removes the need for manually labeled data and execution environments during inference, offering a robust, environment-agnostic approach to multi-objective code generation that could enhance real-world software performance and sustainability.

Abstract

CodeLLMs have demonstrated remarkable advancements in software engineering tasks. However, while these models can generate functionally correct code, they often produce code that is inefficient in terms of runtime. This inefficiency is particularly problematic in resource-constrained environments, impacting software performance and sustainability. Existing approaches for optimizing code efficiency for CodeLLMs like SOAP and PIE exhibit certain limitations. SOAP requires a compatible execution environment and predefined test cases for iterative code modification, while PIE focuses on instruction tuning, improving efficiency but compromising correctness. These shortcomings highlight the need for a fine-tuning framework that optimizes both efficiency and correctness without relying on predefined test cases or specific execution environments. To bridge this gap, we introduce ACECode, a reinforcement learning-based fine-tuning framework that aligns CodeLLMs with dual objectives of efficiency and correctness. ACECode combines three key steps: (1) generating code with an actor CodeLLM, (2) calculating a training-free reward signal derived from code execution feedback for each generated code, and (3) optimizing the CodeLLM via Proximal Policy Optimization (PPO) algorithm. This reward signal enables joint assessment of efficiency and correctness without manual labeling. We evaluate ACECode by fine-tuning four SOTA (state-of-the-art) CodeLLMs and comparing their code with three baselines: original, instruction-tuned, and PIE-tuned CodeLLMs. Extensive experiment results suggest that \tool{} significantly improves the efficiency and correctness of generated code against all baselines for all CodeLLMs. Specifically, CodeLLMs fine-tuned with ACECode improve pass@1 by 1.84% to 14.51% and reduce runtime in 65% to 72% of cases compared to original CodeLLMs.

ACECode: A Reinforcement Learning Framework for Aligning Code Efficiency and Correctness in Code Language Models

TL;DR

This work targets the problem that CodeLLMs often produce functionally correct but runtime-inefficient code. It introduces ACECode, a reinforcement learning fine-tuning framework that uses a training-free rewarder derived from code execution to jointly optimize code efficiency and correctness via Proximal Policy Optimization. ACECode demonstrates significant gains in code correctness (pass@1) and efficiency (ECC and GET) across four state-of-the-art open-source CodeLLMs, outperforming original, instruction-tuned, and PIE baselines, with improvements of up to in pass@1 and up to in GET, while reducing runtime in of cases. The framework removes the need for manually labeled data and execution environments during inference, offering a robust, environment-agnostic approach to multi-objective code generation that could enhance real-world software performance and sustainability.

Abstract

CodeLLMs have demonstrated remarkable advancements in software engineering tasks. However, while these models can generate functionally correct code, they often produce code that is inefficient in terms of runtime. This inefficiency is particularly problematic in resource-constrained environments, impacting software performance and sustainability. Existing approaches for optimizing code efficiency for CodeLLMs like SOAP and PIE exhibit certain limitations. SOAP requires a compatible execution environment and predefined test cases for iterative code modification, while PIE focuses on instruction tuning, improving efficiency but compromising correctness. These shortcomings highlight the need for a fine-tuning framework that optimizes both efficiency and correctness without relying on predefined test cases or specific execution environments. To bridge this gap, we introduce ACECode, a reinforcement learning-based fine-tuning framework that aligns CodeLLMs with dual objectives of efficiency and correctness. ACECode combines three key steps: (1) generating code with an actor CodeLLM, (2) calculating a training-free reward signal derived from code execution feedback for each generated code, and (3) optimizing the CodeLLM via Proximal Policy Optimization (PPO) algorithm. This reward signal enables joint assessment of efficiency and correctness without manual labeling. We evaluate ACECode by fine-tuning four SOTA (state-of-the-art) CodeLLMs and comparing their code with three baselines: original, instruction-tuned, and PIE-tuned CodeLLMs. Extensive experiment results suggest that \tool{} significantly improves the efficiency and correctness of generated code against all baselines for all CodeLLMs. Specifically, CodeLLMs fine-tuned with ACECode improve pass@1 by 1.84% to 14.51% and reduce runtime in 65% to 72% of cases compared to original CodeLLMs.

Paper Structure

This paper contains 32 sections, 11 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Overview of RLHF. Generally, the process begins with the LLM generating multiple responses for a given prompt, which are then ranked by human labelers based on their preference. These ranked responses are used to train a reward model that assigns a reward score to evaluate the human preference of newly generated responses. The LLM is then fine-tuned using the Proximal Policy Optimization (PPO) algorithm to optimize its outputs for higher reward scores, iteratively aligning LLM's outputs with human preferences.
  • Figure 2: Overview of ACECode with PPO training. The Actor LLM, which serves as the target model for optimization, takes prompts as input and generates code snippets at each step. For each LLM output, the Unit-test Based Rewarder executes the generated code with all test cases, assigning a reward score to the generated code, representing both code correctness and efficiency based on execution feedback. This reward score is used to guide the Actor LLM in optimizing its policy via the PPO algorithm. Meanwhile, the Critic LLM estimates the value function for each Actor LLM output, facilitating more stable policy optimization.
  • Figure 3: Impact of Training Epoch Numbers on ACECode Performance. Pass@1 and GET relative improvements refer to the percentage of improvement in terms of Pass@1 and GET scores compared to the original CodeLLMs.
  • Figure 4: Impact of Number of Responses on ACECode Performance. Pass@1 and GET relative improvements refer to the percentage of improvement in terms of Pass@1 and GET scores compared to the original CodeLLMs.