Table of Contents
Fetching ...

Seed-Coder: Let the Code Model Curate Data for Itself

ByteDance Seed, Yuyu Zhang, Jing Su, Yifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, Tao Sun, Jinhua Zhu, Shulin Xin, Dong Huang, Yetao Bai, Lixin Dong, Chao Li, Jianchong Chen, Hanzhi Zhou, Yifan Huang, Guanghan Ning, Xierui Song, Jiaze Chen, Siyao Liu, Kai Shen, Liang Xiang, Yonghui Wu

TL;DR

Seed-Coder introduces a model-centric data pipeline to pretrain open-source code LLMs at 8B with minimal human curation, achieving state-of-the-art results among peers of similar size. It combines a multi-source data strategy (GitHub, commits, code-related web data) with two-stage post-training: instruct fine-tuning with direct preference optimization and a LongCoT-inspired reinforcement learning for reasoning. The approach emphasizes data quality via LLM-based filtering, sandbox self-correction, and a Fill-in-the-Middle training regime, along with a decontamination step to preserve benchmark integrity. Empirical results demonstrate strong performance across code generation, completion, editing, reasoning, and software engineering tasks, showcasing the practicality and competitiveness of open-source, self-curated data pipelines for code intelligence.

Abstract

Code data in large language model (LLM) pretraining is recognized crucial not only for code-related tasks but also for enhancing general intelligence of LLMs. Current open-source LLMs often heavily rely on human effort to produce their code pretraining data, such as employing hand-crafted filtering rules tailored to individual programming languages, or using human-annotated data to train quality filters. However, these approaches are inherently limited in scalability, prone to subjective biases, and costly to extend and maintain across diverse programming languages. To address these challenges, we introduce Seed-Coder, a series of open-source LLMs comprising base, instruct and reasoning models of 8B size, minimizing human involvement in data construction. Our code pretraining data is produced by a model-centric data pipeline, which predominantly leverages LLMs for scoring and filtering code data. The instruct model is further trained via supervised fine-tuning and preference optimization, and the reasoning model leverages Long-Chain-of-Thought (LongCoT) reinforcement learning to improve multi-step code reasoning. Seed-Coder achieves state-of-the-art results among open-source models of similar size and even surpasses some much larger models, demonstrating superior performance in code generation, code completion, code editing, code reasoning, and software engineering tasks.

Seed-Coder: Let the Code Model Curate Data for Itself

TL;DR

Seed-Coder introduces a model-centric data pipeline to pretrain open-source code LLMs at 8B with minimal human curation, achieving state-of-the-art results among peers of similar size. It combines a multi-source data strategy (GitHub, commits, code-related web data) with two-stage post-training: instruct fine-tuning with direct preference optimization and a LongCoT-inspired reinforcement learning for reasoning. The approach emphasizes data quality via LLM-based filtering, sandbox self-correction, and a Fill-in-the-Middle training regime, along with a decontamination step to preserve benchmark integrity. Empirical results demonstrate strong performance across code generation, completion, editing, reasoning, and software engineering tasks, showcasing the practicality and competitiveness of open-source, self-curated data pipelines for code intelligence.

Abstract

Code data in large language model (LLM) pretraining is recognized crucial not only for code-related tasks but also for enhancing general intelligence of LLMs. Current open-source LLMs often heavily rely on human effort to produce their code pretraining data, such as employing hand-crafted filtering rules tailored to individual programming languages, or using human-annotated data to train quality filters. However, these approaches are inherently limited in scalability, prone to subjective biases, and costly to extend and maintain across diverse programming languages. To address these challenges, we introduce Seed-Coder, a series of open-source LLMs comprising base, instruct and reasoning models of 8B size, minimizing human involvement in data construction. Our code pretraining data is produced by a model-centric data pipeline, which predominantly leverages LLMs for scoring and filtering code data. The instruct model is further trained via supervised fine-tuning and preference optimization, and the reasoning model leverages Long-Chain-of-Thought (LongCoT) reinforcement learning to improve multi-step code reasoning. Seed-Coder achieves state-of-the-art results among open-source models of similar size and even surpasses some much larger models, demonstrating superior performance in code generation, code completion, code editing, code reasoning, and software engineering tasks.

Paper Structure

This paper contains 42 sections, 2 equations, 13 figures, 17 tables.

Figures (13)

  • Figure 1: Benchmark performance of instruct and reasoning variants of Seed-Coder-8B.
  • Figure 2: Processing pipeline for pretraining data. We collected data from GitHub and web archives. The raw data were processed into four categories: file-level codes (yellow), repository-level codes (blue), GitHub commits (red) and code-related web data (green). For each phase in pretraining, we combined and reorganized the processed data from the four categories, indicated by colors on the top-right of the blocks.
  • Figure 3: Sample Python script with decent structure but logical errors.
  • Figure 4: Sample code snippet from a Python script for LED display controlling. The original code (left) and a visually intuitive format by replacing zeros and commas with space (right) are presented.
  • Figure 5: Pipeline of our data quality scorer.
  • ...and 8 more figures