Table of Contents
Fetching ...

Narrow Transformer: StarCoder-Based Java-LM For Desktop

Kamalkumar Rathinasamy, Balaji A J, Ankush Kumar, Gagan Gayari, Harshini K, Rajab Ali Mondal, Sreenivasa Raghavan K S, Swayam Singh, Mohammed Rafee Tarafdar

TL;DR

The paper presents NT-Java-1.1B, a StarCoderBase-1.1B–based Java-specific code LLM optimized for desktop deployment. It introduces The Stack v2, a 900B+ token, language-diverse open dataset built from Software Heritage and additional open sources, with extensive data curation (license handling, PII redaction, decontamination, and opt-out processing). Through a two-stage training regime, it delivers 3B/7B/15B StarCoder2 variants and demonstrates state-of-the-art Java performance on benchmarks like MultiPL-E Java, while quantized versions enable desktop use. The work emphasizes open science, auditable data provenance, and governance, and evaluates the models across code completion, editing, reasoning, and security benchmarks, showing strong performance with attention to safety and bias considerations. Practically, NT-Java-1.1B lays the groundwork for language- and size-specific code LLM families that can be trained and deployed locally, reducing reliance on GPUs for desktop development tasks while maintaining competitive performance.

Abstract

This paper presents NT-Java-1.1B, an open-source specialized code language model built on StarCoderBase-1.1B, designed for coding tasks in Java programming. NT-Java-1.1B achieves state-of-the-art performance, surpassing its base model and majority of other models of similar size on MultiPL-E Java code benchmark. While there have been studies on extending large, generic pre-trained models to improve proficiency in specific programming languages like Python, similar investigations on small code models for other programming languages are lacking. Large code models require specialized hardware like GPUs for inference, highlighting the need for research into building small code models that can be deployed on developer desktops. This paper addresses this research gap by focusing on the development of a small Java code model, NT-Java-1.1B, and its quantized versions, which performs comparably to open models around 1.1B on MultiPL-E Java code benchmarks, making them ideal for desktop deployment. This paper establishes the foundation for specialized models across languages and sizes for a family of NT Models.

Narrow Transformer: StarCoder-Based Java-LM For Desktop

TL;DR

The paper presents NT-Java-1.1B, a StarCoderBase-1.1B–based Java-specific code LLM optimized for desktop deployment. It introduces The Stack v2, a 900B+ token, language-diverse open dataset built from Software Heritage and additional open sources, with extensive data curation (license handling, PII redaction, decontamination, and opt-out processing). Through a two-stage training regime, it delivers 3B/7B/15B StarCoder2 variants and demonstrates state-of-the-art Java performance on benchmarks like MultiPL-E Java, while quantized versions enable desktop use. The work emphasizes open science, auditable data provenance, and governance, and evaluates the models across code completion, editing, reasoning, and security benchmarks, showing strong performance with attention to safety and bias considerations. Practically, NT-Java-1.1B lays the groundwork for language- and size-specific code LLM families that can be trained and deployed locally, reducing reliance on GPUs for desktop development tasks while maintaining competitive performance.

Abstract

This paper presents NT-Java-1.1B, an open-source specialized code language model built on StarCoderBase-1.1B, designed for coding tasks in Java programming. NT-Java-1.1B achieves state-of-the-art performance, surpassing its base model and majority of other models of similar size on MultiPL-E Java code benchmark. While there have been studies on extending large, generic pre-trained models to improve proficiency in specific programming languages like Python, similar investigations on small code models for other programming languages are lacking. Large code models require specialized hardware like GPUs for inference, highlighting the need for research into building small code models that can be deployed on developer desktops. This paper addresses this research gap by focusing on the development of a small Java code model, NT-Java-1.1B, and its quantized versions, which performs comparably to open models around 1.1B on MultiPL-E Java code benchmarks, making them ideal for desktop deployment. This paper establishes the foundation for specialized models across languages and sizes for a family of NT Models.
Paper Structure (173 sections, 2 figures, 28 tables)

This paper contains 173 sections, 2 figures, 28 tables.

Figures (2)

  • Figure 1: File-level license assignment logic.
  • Figure 2: The distribution of the top $20$ programming languages in our crawled documentation collection.