Table of Contents
Fetching ...

Crystal: Illuminating LLM Abilities on Language and Code

Tianhua Tao, Junbo Li, Bowen Tan, Hongyi Wang, William Marshall, Bhargav M Kanakiya, Joel Hestness, Natalia Vassilieva, Zhiqiang Shen, Eric P. Xing, Zhengzhong Liu

TL;DR

This work proposes a pretraining strategy to enhance the integration of natural language and coding capabilities within a single LLM, which includes two phases of training with appropriately adjusted code/language ratios and demonstrates remarkable capabilities in both domains.

Abstract

Large Language Models (LLMs) specializing in code generation (which are also often referred to as code LLMs), e.g., StarCoder and Code Llama, play increasingly critical roles in various software development scenarios. It is also crucial for code LLMs to possess both code generation and natural language abilities for many specific applications, such as code snippet retrieval using natural language or code explanations. The intricate interaction between acquiring language and coding skills complicates the development of strong code LLMs. Furthermore, there is a lack of thorough prior studies on the LLM pretraining strategy that mixes code and natural language. In this work, we propose a pretraining strategy to enhance the integration of natural language and coding capabilities within a single LLM. Specifically, it includes two phases of training with appropriately adjusted code/language ratios. The resulting model, Crystal, demonstrates remarkable capabilities in both domains. Specifically, it has natural language and coding performance comparable to that of Llama 2 and Code Llama, respectively. Crystal exhibits better data efficiency, using 1.4 trillion tokens compared to the more than 2 trillion tokens used by Llama 2 and Code Llama. We verify our pretraining strategy by analyzing the training process and observe consistent improvements in most benchmarks. We also adopted a typical application adaptation phase with a code-centric data mixture, only to find that it did not lead to enhanced performance or training efficiency, underlining the importance of a carefully designed data recipe. To foster research within the community, we commit to open-sourcing every detail of the pretraining, including our training datasets, code, loggings and 136 checkpoints throughout the training.

Crystal: Illuminating LLM Abilities on Language and Code

TL;DR

This work proposes a pretraining strategy to enhance the integration of natural language and coding capabilities within a single LLM, which includes two phases of training with appropriately adjusted code/language ratios and demonstrates remarkable capabilities in both domains.

Abstract

Large Language Models (LLMs) specializing in code generation (which are also often referred to as code LLMs), e.g., StarCoder and Code Llama, play increasingly critical roles in various software development scenarios. It is also crucial for code LLMs to possess both code generation and natural language abilities for many specific applications, such as code snippet retrieval using natural language or code explanations. The intricate interaction between acquiring language and coding skills complicates the development of strong code LLMs. Furthermore, there is a lack of thorough prior studies on the LLM pretraining strategy that mixes code and natural language. In this work, we propose a pretraining strategy to enhance the integration of natural language and coding capabilities within a single LLM. Specifically, it includes two phases of training with appropriately adjusted code/language ratios. The resulting model, Crystal, demonstrates remarkable capabilities in both domains. Specifically, it has natural language and coding performance comparable to that of Llama 2 and Code Llama, respectively. Crystal exhibits better data efficiency, using 1.4 trillion tokens compared to the more than 2 trillion tokens used by Llama 2 and Code Llama. We verify our pretraining strategy by analyzing the training process and observe consistent improvements in most benchmarks. We also adopted a typical application adaptation phase with a code-centric data mixture, only to find that it did not lead to enhanced performance or training efficiency, underlining the importance of a carefully designed data recipe. To foster research within the community, we commit to open-sourcing every detail of the pretraining, including our training datasets, code, loggings and 136 checkpoints throughout the training.

Paper Structure

This paper contains 29 sections, 9 figures, 7 tables.

Figures (9)

  • Figure 1: The multi-phase training process for Crystal.
  • Figure 2: Crystal shows a good balance of language and coding abilities. The $y$-axis is the average over ARC-C, HellaSwag, MMLU, and GSM8K. The $x$-axis is the average of MBPP and HumanEval.
  • Figure 3: Pretraining loss curve. The gray dashed line divides Phase 1 and 2. We do not observe many major loss spikes; if observed, we recovered by skipping specific data batches.
  • Figure 4: Evaluation results comparison across different models for zero-shot WebMC
  • Figure 5: Benchmark scores for all intermediate checkpoints across phases. Contrary to prior work, our Adaptation Phase does not improve the model. Instruction finetuning generally boost the model performance as expected.
  • ...and 4 more figures