Scaling Granite Code Models to 128K Context

Matt Stallone; Vaibhav Saxena; Leonid Karlinsky; Bridget McGinn; Tim Bula; Mayank Mishra; Adriana Meza Soria; Gaoyuan Zhang; Aditya Prasad; Yikang Shen; Saptha Surendran; Shanmukha Guttula; Hima Patel; Parameswaran Selvam; Xuan-Hong Dang; Yan Koyfman; Atin Sood; Rogerio Feris; Nirmit Desai; David D. Cox; Ruchir Puri; Rameswar Panda

Scaling Granite Code Models to 128K Context

Matt Stallone, Vaibhav Saxena, Leonid Karlinsky, Bridget McGinn, Tim Bula, Mayank Mishra, Adriana Meza Soria, Gaoyuan Zhang, Aditya Prasad, Yikang Shen, Saptha Surendran, Shanmukha Guttula, Hima Patel, Parameswaran Selvam, Xuan-Hong Dang, Yan Koyfman, Atin Sood, Rogerio Feris, Nirmit Desai, David D. Cox, Ruchir Puri, Rameswar Panda

TL;DR

The paper tackles the limitation of open-source code LLMs with small context windows by presenting Granite code models that support up to $128K$ tokens. It achieves this through a two-stage process: lightweight continual pretraining with repository-level packing and RoPE base-frequency adjustments, followed by instruction tuning on a mixture of short- and synthetic long-context data. Results show substantial improvements on long-context tasks such as Long Code Completion, RepoBench-P, and RepoQA, with minimal or no degradation on short-context benchmarks like HumanEval, and instruct models exhibit especially strong retrieval performance. The authors release all long-context Granite models under the Apache 2.0 license, enabling both research and commercial use and paving the way for further scaling of context lengths in code-focused LLMs.

Abstract

This paper introduces long-context Granite code models that support effective context windows of up to 128K tokens. Our solution for scaling context length of Granite 3B/8B code models from 2K/4K to 128K consists of a light-weight continual pretraining by gradually increasing its RoPE base frequency with repository-level file packing and length-upsampled long-context data. Additionally, we also release instruction-tuned models with long-context support which are derived by further finetuning the long context base models on a mix of permissively licensed short and long-context instruction-response pairs. While comparing to the original short-context Granite code models, our long-context models achieve significant improvements on long-context tasks without any noticeable performance degradation on regular code completion benchmarks (e.g., HumanEval). We release all our long-context Granite code models under an Apache 2.0 license for both research and commercial use.

Scaling Granite Code Models to 128K Context

TL;DR

The paper tackles the limitation of open-source code LLMs with small context windows by presenting Granite code models that support up to

tokens. It achieves this through a two-stage process: lightweight continual pretraining with repository-level packing and RoPE base-frequency adjustments, followed by instruction tuning on a mixture of short- and synthetic long-context data. Results show substantial improvements on long-context tasks such as Long Code Completion, RepoBench-P, and RepoQA, with minimal or no degradation on short-context benchmarks like HumanEval, and instruct models exhibit especially strong retrieval performance. The authors release all long-context Granite models under the Apache 2.0 license, enabling both research and commercial use and paving the way for further scaling of context lengths in code-focused LLMs.

Abstract

Paper Structure (10 sections, 3 figures, 5 tables)

This paper contains 10 sections, 3 figures, 5 tables.

Introduction
Long Context Modeling
Continual Pretraining
Instruction Tuning
Results
Benchmarks
Base Model Evaluations
Instruct Model Evaluations
Short Context Evaluations
Conclusion

Figures (3)

Figure 1: Retrieval accuracy of Granite 3B/8B code instruct models before and after scaling to 128K context length on RepoQA benchmark (with a matching threshold of 0.5).
Figure 2: Key retrieval (a.k.a Needle-in-a-Haystack) performance of Granite-8B-Code-Instruct with context scaling. X-axis represents sequence length (tokens) and Y-axis represents key offset percent in retrieval. Best viewed in color.
Figure 3: Effect of long-context extension on HumanEval benchmark. While we observe a slight degradation in performance for base models, instruct models see an improvement with long-context scaling, most likely due to our mixing of short-context SFT data with long-context multi-turn synthetic data. Best viewed in color.

Scaling Granite Code Models to 128K Context

TL;DR

Abstract

Scaling Granite Code Models to 128K Context

Authors

TL;DR

Abstract

Table of Contents

Figures (3)