Table of Contents
Fetching ...

CWM: An Open-Weights LLM for Research on Code Generation with World Models

FAIR CodeGen team, Jade Copet, Quentin Carbonneaux, Gal Cohen, Jonas Gehring, Jacob Kahn, Jannik Kossen, Felix Kreuk, Emily McMilin, Michel Meyer, Yuxiang Wei, David Zhang, Kunhao Zheng, Jordi Armengol-Estapé, Pedram Bashiri, Maximilian Beck, Pierre Chambon, Abhishek Charnalia, Chris Cummins, Juliette Decugis, Zacharias V. Fisches, François Fleuret, Fabian Gloeckle, Alex Gu, Michael Hassid, Daniel Haziza, Badr Youbi Idrissi, Christian Keller, Rahul Kindi, Hugh Leather, Gallil Maimon, Aram Markosyan, Francisco Massa, Pierre-Emmanuel Mazaré, Vegard Mella, Naila Murray, Keyur Muzumdar, Peter O'Hearn, Matteo Pagliardini, Dmitrii Pedchenko, Tal Remez, Volker Seeker, Marco Selvi, Oren Sultan, Sida Wang, Luca Wehrstedt, Ori Yoran, Lingming Zhang, Taco Cohen, Yossi Adi, Gabriel Synnaeve

TL;DR

Code World Model (CWM) tackles the gap between static code data and executable dynamics by mid-training on Python execution traces and agentic Docker trajectories, followed by supervised fine-tuning and multi-task reinforcement learning across coding, math, and software-engineering domains. The 32B dense decoder-only Transformer with up to $131k$ context supports step-by-step Python execution simulation and trace-grounded reasoning, achieving competitive results on SWE-bench Verified and related benchmarks. By releasing final and intermediate checkpoints, the work enables open research into world-model guided code generation, reasoning, and planning within computational environments, while addressing transparency and risk considerations. The results indicate that grounding code generation in execution dynamics and environment interactions can improve correctness, debugging, and long-context reasoning, and point toward future directions in neural debugging and grounded chain-of-thought.

Abstract

We release Code World Model (CWM), a 32-billion-parameter open-weights LLM, to advance research on code generation with world models. To improve code understanding beyond what can be learned from training on static code alone, we mid-train CWM on a large amount of observation-action trajectories from Python interpreter and agentic Docker environments, and perform extensive multi-task reasoning RL in verifiable coding, math, and multi-turn software engineering environments. With CWM, we provide a strong testbed for researchers to explore the opportunities world modeling affords for improving code generation with reasoning and planning in computational environments. We present first steps of how world models can benefit agentic coding, enable step-by-step simulation of Python code execution, and show early results of how reasoning can benefit from the latter. CWM is a dense, decoder-only LLM trained with a context size of up to 131k tokens. Independent of its world modeling capabilities, CWM offers strong performance on general coding and math tasks: it reaches pass@1 scores of 65.8% on SWE-bench Verified (with test-time scaling), 68.6% on LiveCodeBench, 96.6% on Math-500, and 76.0% on AIME 2024. To support further research on code world modeling, we release model checkpoints after mid-training, SFT, and RL.

CWM: An Open-Weights LLM for Research on Code Generation with World Models

TL;DR

Code World Model (CWM) tackles the gap between static code data and executable dynamics by mid-training on Python execution traces and agentic Docker trajectories, followed by supervised fine-tuning and multi-task reinforcement learning across coding, math, and software-engineering domains. The 32B dense decoder-only Transformer with up to context supports step-by-step Python execution simulation and trace-grounded reasoning, achieving competitive results on SWE-bench Verified and related benchmarks. By releasing final and intermediate checkpoints, the work enables open research into world-model guided code generation, reasoning, and planning within computational environments, while addressing transparency and risk considerations. The results indicate that grounding code generation in execution dynamics and environment interactions can improve correctness, debugging, and long-context reasoning, and point toward future directions in neural debugging and grounded chain-of-thought.

Abstract

We release Code World Model (CWM), a 32-billion-parameter open-weights LLM, to advance research on code generation with world models. To improve code understanding beyond what can be learned from training on static code alone, we mid-train CWM on a large amount of observation-action trajectories from Python interpreter and agentic Docker environments, and perform extensive multi-task reasoning RL in verifiable coding, math, and multi-turn software engineering environments. With CWM, we provide a strong testbed for researchers to explore the opportunities world modeling affords for improving code generation with reasoning and planning in computational environments. We present first steps of how world models can benefit agentic coding, enable step-by-step simulation of Python code execution, and show early results of how reasoning can benefit from the latter. CWM is a dense, decoder-only LLM trained with a context size of up to 131k tokens. Independent of its world modeling capabilities, CWM offers strong performance on general coding and math tasks: it reaches pass@1 scores of 65.8% on SWE-bench Verified (with test-time scaling), 68.6% on LiveCodeBench, 96.6% on Math-500, and 76.0% on AIME 2024. To support further research on code world modeling, we release model checkpoints after mid-training, SFT, and RL.

Paper Structure

This paper contains 48 sections, 5 equations, 31 figures, 14 tables.

Figures (31)

  • Figure 1: Overview of the CWM training stages and the model checkpoints that we release. We generally report performance of the final CWM (instruct, RL trained) model, except where otherwise stated.
  • Figure 2: On SWE-bench Verified, CWM outperforms open-weight models with similar parameter counts and is even competitive with much larger or closed-weight LLMs. The base score for CWM is computed with a single attempt per instance (no retries, majority voting, or parallel candidates), averaged over multiple runs to reduce variance. For "Test Time Scaling", we generate multiple candidates in parallel and then submit one patch based on ranking. The "Test Time Scaling" score for GPT-oss models is high reasoning budget, while the lower score is low. (*: GPT-5 and GPT-oss use a custom subset of 477.0 problems, while CWM is evaluated on the full set of 500.0 problems.)
  • Figure 3: CWM format for Python traces. Given a source code context and a marker of the trace starting point, CWM predicts a series of stack frames representing the Program states and the actions (executed code).
  • Figure 4: Example of CWM solving a competitive programming problem in agentic fashion. The reasoning steps are shortened and some details are omitted due to space constraints. (Tool calls in purple, environment feedback in orange, and reasoning in blue.)
  • Figure 5: Example with execution trace prediction and reasoning. In this example, we add an execution trace example to the prompt. After reasoning in natural language about the code, the model uses its execution trace prediction capability to confirm the correct return value {1: 2, 2: 4}. We encode all special tokens (e.g. <|frame_sep|>) as such.
  • ...and 26 more figures