Table of Contents
Fetching ...

Skill over Scale: The Case for Medium, Domain-Specific Models for SE

Manisha Mukherjee, Vincent J. Hellendoorn

TL;DR

The work debunks the inevitability of relying on large generalist LLMs for code-related tasks by showing that medium-sized, domain-specific models trained on in-domain data can outperform larger models when trained with contemporary LLM practices. SOBertBase and SOBertLarge, trained on $19$ GB of StackOverflow data with a $2048$-token context using the Megatron-LM toolkit, achieve superior performance across four StackOverflow labeling tasks and a novel obsoletion-detection benchmark, at comparatively modest costs. The study highlights the importance of long-range context, document-level data, and carefully designed pretraining for domain-specific NLP in SE, and releases the models publicly to encourage open, affordable alternatives to closed-source LLMs. Overall, the results suggest that targeted, well-executed domain pretraining can yield practical, high-performance solutions for domain-specific NLP challenges in software engineering.

Abstract

Recent advancements in AI have sparked a trend in constructing large, generalist language models that handle a multitude of tasks, including many code-related ones. While these models are expensive to train and are often closed-source, they have enjoyed broad adoption because they tend to outperform smaller, domain-specific models of code. In this work, we argue that this is not a foregone conclusion. We show that modestly sized domain-specific models can outperform much larger ones on code labeling tasks, provided they are trained to the same standards. Concretely, we focus on StackOverflow (SO), which offers large volumes of aligned code and text data. We align established best-practices for pre-training large language models with properties of SO as a data source, especially using a large context window (2,048 tokens), coupled with a powerful toolkit (Megatron-LM) to train two models: SOBertBase (125M parameters) and SOBertLarge (762M parameters), at a budget of just $374 and $1600 each. We compare the performance of our models with a prior domain-specific model which did not adopt many of these practices (BERTOverflow), as well two general-purpose BERT models and two models in OpenAI's GPT series (GPT-3.5 and GPT-4). We study four labeling tasks: question quality prediction, closed question prediction, NER and obsoletion prediction. The final task is a new benchmark we introduce, on which we additionally compare SOBert with a fine-tuned CodeLlama and StackLlama (models with 10x more parameters than SOBertLarge). Our models consistently outperform all baselines. In contrast, BertOverflow is outperformed by generalist models in most tasks. These results demonstrate that pre-training both extensively and properly on in-domain data can yield a powerful and affordable alternative to leveraging closed-source general-purpose models. Both models are released to the public on Hugging Face.

Skill over Scale: The Case for Medium, Domain-Specific Models for SE

TL;DR

The work debunks the inevitability of relying on large generalist LLMs for code-related tasks by showing that medium-sized, domain-specific models trained on in-domain data can outperform larger models when trained with contemporary LLM practices. SOBertBase and SOBertLarge, trained on GB of StackOverflow data with a -token context using the Megatron-LM toolkit, achieve superior performance across four StackOverflow labeling tasks and a novel obsoletion-detection benchmark, at comparatively modest costs. The study highlights the importance of long-range context, document-level data, and carefully designed pretraining for domain-specific NLP in SE, and releases the models publicly to encourage open, affordable alternatives to closed-source LLMs. Overall, the results suggest that targeted, well-executed domain pretraining can yield practical, high-performance solutions for domain-specific NLP challenges in software engineering.

Abstract

Recent advancements in AI have sparked a trend in constructing large, generalist language models that handle a multitude of tasks, including many code-related ones. While these models are expensive to train and are often closed-source, they have enjoyed broad adoption because they tend to outperform smaller, domain-specific models of code. In this work, we argue that this is not a foregone conclusion. We show that modestly sized domain-specific models can outperform much larger ones on code labeling tasks, provided they are trained to the same standards. Concretely, we focus on StackOverflow (SO), which offers large volumes of aligned code and text data. We align established best-practices for pre-training large language models with properties of SO as a data source, especially using a large context window (2,048 tokens), coupled with a powerful toolkit (Megatron-LM) to train two models: SOBertBase (125M parameters) and SOBertLarge (762M parameters), at a budget of just 1600 each. We compare the performance of our models with a prior domain-specific model which did not adopt many of these practices (BERTOverflow), as well two general-purpose BERT models and two models in OpenAI's GPT series (GPT-3.5 and GPT-4). We study four labeling tasks: question quality prediction, closed question prediction, NER and obsoletion prediction. The final task is a new benchmark we introduce, on which we additionally compare SOBert with a fine-tuned CodeLlama and StackLlama (models with 10x more parameters than SOBertLarge). Our models consistently outperform all baselines. In contrast, BertOverflow is outperformed by generalist models in most tasks. These results demonstrate that pre-training both extensively and properly on in-domain data can yield a powerful and affordable alternative to leveraging closed-source general-purpose models. Both models are released to the public on Hugging Face.
Paper Structure (27 sections, 1 equation, 6 figures, 2 tables)

This paper contains 27 sections, 1 equation, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Model and dataset sizes of state-of-the-art LLMs.
  • Figure 2: Example StackOverflow page (ID: 14569223) demonstrating the structure of a question and answer on SO. Questions have a title and body and may have comments; answers have a body and, optionally, comments. We demonstrate which features are used for each of the four downstream tasks.
  • Figure 3: Framework outlining the key steps of our approach
  • Figure 4: Histogram showing length buckets of post+comments samples
  • Figure 5: Training and Validation Loss comparison of SOBBase and SOBLarge.
  • ...and 1 more figures