Table of Contents
Fetching ...

SemCoder: Training Code Language Models with Comprehensive Semantics Reasoning

Yangruibo Ding, Jinjun Peng, Marcus J. Min, Gail Kaiser, Junfeng Yang, Baishakhi Ray

TL;DR

The paper targets a key limitation of code LLMs: shallow understanding of program semantics. It introduces monologue reasoning and a large semantic-training pipeline using the PyX dataset and the PyX-R debugging corpus to teach models to reason about high-level goals, per-line effects, and final I/O behavior. SemCoder, a 6.7B model, demonstrates competitive or superior performance on code generation and execution reasoning benchmarks, outperforming GPT-3.5-turbo and many open-source rivals. The work shows that monologue-style, execution-aware training improves debugging and self-refinement, suggesting substantial practical impact for building more reliable programming assistants. Public release of data, code, and model checkpoints further enables adoption and benchmarking in the developer and research communities.

Abstract

Code Large Language Models (Code LLMs) have excelled at tasks like code completion but often miss deeper semantics such as execution effects and dynamic states. This paper aims to bridge the gap between Code LLMs' reliance on static text data and the need for semantic understanding for complex tasks like debugging and program repair. We introduce a novel strategy, monologue reasoning, to train Code LLMs to reason comprehensive semantics, encompassing high-level functional descriptions, local execution effects of individual statements, and overall input/output behavior, thereby linking static code text with dynamic execution states. We begin by collecting PyX, a clean Python corpus of fully executable code samples with functional descriptions and test cases. We propose training Code LLMs not only to write code but also to understand code semantics by reasoning about key properties, constraints, and execution behaviors using natural language, mimicking human verbal debugging, i.e., rubber-duck debugging. This approach led to the development of SemCoder, a Code LLM with only 6.7B parameters, which shows competitive performance with GPT-3.5-turbo on code generation and execution reasoning tasks. SemCoder achieves 79.3% on HumanEval (GPT-3.5-turbo: 76.8%), 63.6% on CRUXEval-I (GPT-3.5-turbo: 50.3%), and 63.9% on CRUXEval-O (GPT-3.5-turbo: 59.0%). We also study the effectiveness of SemCoder's monologue-style execution reasoning compared to concrete scratchpad reasoning, showing that our approach integrates semantics from multiple dimensions more smoothly. Finally, we demonstrate the potential of applying learned semantics to improve Code LLMs' debugging and self-refining capabilities. Our data, code, and models are available at: https://github.com/ARiSE-Lab/SemCoder.

SemCoder: Training Code Language Models with Comprehensive Semantics Reasoning

TL;DR

The paper targets a key limitation of code LLMs: shallow understanding of program semantics. It introduces monologue reasoning and a large semantic-training pipeline using the PyX dataset and the PyX-R debugging corpus to teach models to reason about high-level goals, per-line effects, and final I/O behavior. SemCoder, a 6.7B model, demonstrates competitive or superior performance on code generation and execution reasoning benchmarks, outperforming GPT-3.5-turbo and many open-source rivals. The work shows that monologue-style, execution-aware training improves debugging and self-refinement, suggesting substantial practical impact for building more reliable programming assistants. Public release of data, code, and model checkpoints further enables adoption and benchmarking in the developer and research communities.

Abstract

Code Large Language Models (Code LLMs) have excelled at tasks like code completion but often miss deeper semantics such as execution effects and dynamic states. This paper aims to bridge the gap between Code LLMs' reliance on static text data and the need for semantic understanding for complex tasks like debugging and program repair. We introduce a novel strategy, monologue reasoning, to train Code LLMs to reason comprehensive semantics, encompassing high-level functional descriptions, local execution effects of individual statements, and overall input/output behavior, thereby linking static code text with dynamic execution states. We begin by collecting PyX, a clean Python corpus of fully executable code samples with functional descriptions and test cases. We propose training Code LLMs not only to write code but also to understand code semantics by reasoning about key properties, constraints, and execution behaviors using natural language, mimicking human verbal debugging, i.e., rubber-duck debugging. This approach led to the development of SemCoder, a Code LLM with only 6.7B parameters, which shows competitive performance with GPT-3.5-turbo on code generation and execution reasoning tasks. SemCoder achieves 79.3% on HumanEval (GPT-3.5-turbo: 76.8%), 63.6% on CRUXEval-I (GPT-3.5-turbo: 50.3%), and 63.9% on CRUXEval-O (GPT-3.5-turbo: 59.0%). We also study the effectiveness of SemCoder's monologue-style execution reasoning compared to concrete scratchpad reasoning, showing that our approach integrates semantics from multiple dimensions more smoothly. Finally, we demonstrate the potential of applying learned semantics to improve Code LLMs' debugging and self-refining capabilities. Our data, code, and models are available at: https://github.com/ARiSE-Lab/SemCoder.
Paper Structure (75 sections, 6 figures, 8 tables)

This paper contains 75 sections, 6 figures, 8 tables.

Figures (6)

  • Figure 1: SemCoder's training strategy with different modalities of program semantics. We specify the overall objective of a task, i.e., the approximate semantics (blue box), such as "retrieves potential energies of atoms and performs sorting" followed by the corresponding code solution (pink box). Then we annotate the abstract code semantics as those key properties and constraints (red box) that hold regardless of inputs. Beyond static semantics, we also pair code with test cases, such as "Given [10.5, 8.2, 10.5, 7.1, 8.2], return [3, 1, 0]". We further annotate the dynamic, operational semantics with forward and backward monologues (yellow box, and more in Section \ref{['subsec: monologue']}). SemCoder learns from all the information to not only generate code but comprehensively reason its semantics.
  • Figure 2: Forward monologue simulates the execution step-by-step, and backward monologue deduces the previous program states by making assumptions and checking with observed constraints.
  • Figure 3: SemCoder-S's zero-shot performance of self-refinement at each time step with different sampling strategies.
  • Figure 4: PyX: Execution-aware Training Data Collection Strategy
  • Figure 5: Edit similarities between PyX and two popular benchmarks
  • ...and 1 more figures