Table of Contents
Fetching ...

aiXcoder-7B: A Lightweight and Effective Large Language Model for Code Processing

Siyuan Jiang, Jia Li, He Zong, Huanyu Liu, Hao Zhu, Shukai Hu, Erlu Li, Jiazheng Ding, Yu Han, Wei Ning, Gen Wang, Yihong Dong, Kechi Zhang, Ge Li

TL;DR

aiXcoder-7B tackles the trade-off between code completion accuracy and inference efficiency by presenting a lightweight 7B LLM trained with multi-objective objectives, including Structured Fill-In-the-Middle (SFIM), diverse inter-file data sampling, and a large-scale pre-training corpus totaling 1.2 trillion tokens. The approach combines explicit code structure awareness with cross-file context and rigorous data curation to achieve strong performance across NL2Code, FIM, and cross-file benchmarks, outperforming many similar-sized models and several larger LLMs. The work provides practical training insights and openly releases the model and code, underscoring that careful objective design and data strategies can yield efficient yet effective code-processing LLMs suitable for academia and industry. These contributions advance open-source options for fast, accurate code completion and guidance for training future code-focused models.

Abstract

Large Language Models (LLMs) have been widely used in code completion, and researchers are focusing on scaling up LLMs to improve their accuracy. However, larger LLMs have lower inference efficiency, affecting developers' experience and productivity. In this paper, we propose a lightweight and effective LLM for code completion named aiXcoder-7B. Compared to existing LLMs, aiXcoder-7B achieves higher code completion accuracy while having smaller scales (i.e., 7 billion parameters). We attribute the superiority of aiXcoder-7B to three key factors: (1) Multi-objective training. We employ three training objectives, one of which is our proposed Structured Fill-In-the-Middle (SFIM). SFIM considers the syntax structures in code and effectively improves the performance of LLMs for code. (2) Diverse data sampling strategies. They consider inter-file relationships and enhance the capability of LLMs in understanding cross-file contexts. (3) Extensive high-quality data. We establish a rigorous data collection pipeline and consume a total of 1.2 trillion unique tokens for training aiXcoder-7B. This vast volume of data enables aiXcoder-7B to learn a broad distribution of code. We evaluate aiXcoder-7B in five popular code completion benchmarks and a new benchmark collected by this paper. The results show that aiXcoder-7B outperforms the latest six LLMs with similar sizes and even surpasses four larger LLMs (e.g., StarCoder2-15B and CodeLlama-34B), positioning aiXcoder-7B as a lightweight and effective LLM for academia and industry. Finally, we summarize three valuable insights for helping practitioners train the next generations of LLMs for code. aiXcoder-7B has been open-souced and gained significant attention. Until January 2025, aiXcoder-7B has received 2,226 GitHub Stars.

aiXcoder-7B: A Lightweight and Effective Large Language Model for Code Processing

TL;DR

aiXcoder-7B tackles the trade-off between code completion accuracy and inference efficiency by presenting a lightweight 7B LLM trained with multi-objective objectives, including Structured Fill-In-the-Middle (SFIM), diverse inter-file data sampling, and a large-scale pre-training corpus totaling 1.2 trillion tokens. The approach combines explicit code structure awareness with cross-file context and rigorous data curation to achieve strong performance across NL2Code, FIM, and cross-file benchmarks, outperforming many similar-sized models and several larger LLMs. The work provides practical training insights and openly releases the model and code, underscoring that careful objective design and data strategies can yield efficient yet effective code-processing LLMs suitable for academia and industry. These contributions advance open-source options for fast, accurate code completion and guidance for training future code-focused models.

Abstract

Large Language Models (LLMs) have been widely used in code completion, and researchers are focusing on scaling up LLMs to improve their accuracy. However, larger LLMs have lower inference efficiency, affecting developers' experience and productivity. In this paper, we propose a lightweight and effective LLM for code completion named aiXcoder-7B. Compared to existing LLMs, aiXcoder-7B achieves higher code completion accuracy while having smaller scales (i.e., 7 billion parameters). We attribute the superiority of aiXcoder-7B to three key factors: (1) Multi-objective training. We employ three training objectives, one of which is our proposed Structured Fill-In-the-Middle (SFIM). SFIM considers the syntax structures in code and effectively improves the performance of LLMs for code. (2) Diverse data sampling strategies. They consider inter-file relationships and enhance the capability of LLMs in understanding cross-file contexts. (3) Extensive high-quality data. We establish a rigorous data collection pipeline and consume a total of 1.2 trillion unique tokens for training aiXcoder-7B. This vast volume of data enables aiXcoder-7B to learn a broad distribution of code. We evaluate aiXcoder-7B in five popular code completion benchmarks and a new benchmark collected by this paper. The results show that aiXcoder-7B outperforms the latest six LLMs with similar sizes and even surpasses four larger LLMs (e.g., StarCoder2-15B and CodeLlama-34B), positioning aiXcoder-7B as a lightweight and effective LLM for academia and industry. Finally, we summarize three valuable insights for helping practitioners train the next generations of LLMs for code. aiXcoder-7B has been open-souced and gained significant attention. Until January 2025, aiXcoder-7B has received 2,226 GitHub Stars.

Paper Structure

This paper contains 28 sections, 4 equations, 4 figures, 7 tables, 1 algorithm.

Figures (4)

  • Figure 1: An overview of our data collection pipeline.
  • Figure 2: The distributions of the top 10 programming languages in our source code training data.
  • Figure 3: Examples of selected spans in FIM and SFIM.
  • Figure 4: Performance of LLMs on different types of code in FIM-Eval.