Table of Contents
Fetching ...

CodeShell Technical Report

Rui Xie, Zhengran Zeng, Zhuohao Yu, Chang Gao, Shikun Zhang, Wei Ye

TL;DR

CodeShell addresses the data quality bottleneck in CodeLLMs by coupling a GPT-2–based architecture with ROPE and Grouped-Query Attention and a rigorous 100B-token data pipeline. It demonstrates that high-quality data, rather than sheer scale, drives strong cross-language code understanding and generation, achieving competitive results with only 7B parameters and an 8K context. Key contributions include the data filtering framework, Chinese vocabulary expansion, and thorough evaluations on HumanEval, MBPP, MultiPL-E, and code completion, highlighting substantial gains from curated data. The work suggests practical implications for scalable, efficient code modeling where data governance and context length are pivotal.

Abstract

Code large language models mark a pivotal breakthrough in artificial intelligence. They are specifically crafted to understand and generate programming languages, significantly boosting the efficiency of coding development workflows. In this technical report, we present CodeShell-Base, a seven billion-parameter foundation model with 8K context length, showcasing exceptional proficiency in code comprehension. By incorporating Grouped-Query Attention and Rotary Positional Embedding into GPT-2, CodeShell-Base integrates the structural merits of StarCoder and CodeLlama and forms its unique architectural design. We then carefully built a comprehensive data pre-processing process, including similar data deduplication, perplexity-based data filtering, and model-based data filtering. Through this process, We have curated 100 billion high-quality pre-training data from GitHub. Benefiting from the high-quality data, CodeShell-Base outperforms CodeLlama in Humaneval after training on just 500 billion tokens (5 epochs). We have conducted extensive experiments across multiple language datasets, including Python, Java, and C++, and the results indicate that our model possesses robust foundational capabilities in code comprehension and generation.

CodeShell Technical Report

TL;DR

CodeShell addresses the data quality bottleneck in CodeLLMs by coupling a GPT-2–based architecture with ROPE and Grouped-Query Attention and a rigorous 100B-token data pipeline. It demonstrates that high-quality data, rather than sheer scale, drives strong cross-language code understanding and generation, achieving competitive results with only 7B parameters and an 8K context. Key contributions include the data filtering framework, Chinese vocabulary expansion, and thorough evaluations on HumanEval, MBPP, MultiPL-E, and code completion, highlighting substantial gains from curated data. The work suggests practical implications for scalable, efficient code modeling where data governance and context length are pivotal.

Abstract

Code large language models mark a pivotal breakthrough in artificial intelligence. They are specifically crafted to understand and generate programming languages, significantly boosting the efficiency of coding development workflows. In this technical report, we present CodeShell-Base, a seven billion-parameter foundation model with 8K context length, showcasing exceptional proficiency in code comprehension. By incorporating Grouped-Query Attention and Rotary Positional Embedding into GPT-2, CodeShell-Base integrates the structural merits of StarCoder and CodeLlama and forms its unique architectural design. We then carefully built a comprehensive data pre-processing process, including similar data deduplication, perplexity-based data filtering, and model-based data filtering. Through this process, We have curated 100 billion high-quality pre-training data from GitHub. Benefiting from the high-quality data, CodeShell-Base outperforms CodeLlama in Humaneval after training on just 500 billion tokens (5 epochs). We have conducted extensive experiments across multiple language datasets, including Python, Java, and C++, and the results indicate that our model possesses robust foundational capabilities in code comprehension and generation.
Paper Structure (16 sections, 3 figures, 6 tables)

This paper contains 16 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Prompt of code quality annotation.
  • Figure 2: Training loss over train tokens.
  • Figure 3: The effectiveness of high-quality code data filtering mechanism.