Scaling Laws Behind Code Understanding Model

Jiayi Lin; Hande Dong; Yutao Xie; Lei Zhang

Scaling Laws Behind Code Understanding Model

Jiayi Lin, Hande Dong, Yutao Xie, Lei Zhang

TL;DR

This paper investigates whether the neural scaling law, well established in language models, extends to code understanding by systematically varying data, model size, and compute. Through extensive pre-training of a transformer-encoder on The Stack and evaluation on downstream tasks, the authors confirm a power-law decline in test error with scale, expressed as $e = k x^{-\alpha}$ and equivalently $\log e = -\alpha \log x + \log k$, across data, parameter count, and compute. They show that larger pre-training scales improve performance on code search and clone detection, and they introduce CoLSBERT, a 1.5B-parameter code-understanding model trained on 351B tokens from six languages, achieving state-of-the-art results and strong probing performance. The work highlights practical implications for building bigger, data-rich code models and provides a foundation for future exploration of scale interactions and potential scaling-law breakpoints in code understanding.

Abstract

The scaling law is becoming a fundamental law in many machine learning areas. That is, test error falls off with the power law when increasing training data, model size, and computing resource. However, whether this law is suitable for the task of code understanding is not well studied, and most current language models for code understanding are about 100M parameters, which are relatively "small" compared to large language models. In this paper, we conduct extensive experiments to investigate the scaling law for the code understanding task by varying training data, model size, and computing resource. We validate that the test error of code understanding models falls off with the power law when using larger models, indicating that the scaling law is suitable for the code understanding task. Besides, we apply different scales of models to two downstream code understanding tasks, and find that the performance increases with larger scale of models. Finally, we train a large-scale code understanding model named CoLSBERT with 1.5B parameters on a large dataset using more computing resource, which outperforms previous work by a large margin. We will release our code and the CoLSBERT model when our paper is published.

Scaling Laws Behind Code Understanding Model

TL;DR

and equivalently

, across data, parameter count, and compute. They show that larger pre-training scales improve performance on code search and clone detection, and they introduce CoLSBERT, a 1.5B-parameter code-understanding model trained on 351B tokens from six languages, achieving state-of-the-art results and strong probing performance. The work highlights practical implications for building bigger, data-rich code models and provides a foundation for future exploration of scale interactions and potential scaling-law breakpoints in code understanding.

Abstract

Paper Structure (29 sections, 2 equations, 7 figures, 7 tables)

This paper contains 29 sections, 2 equations, 7 figures, 7 tables.

Introduction
Preliminary
Transformer Architecture
pre-training Tasks
Scaling Law in Language Model
Scaling Law in Code Understanding Model
Method and Implementation
Scaling Training Data
Scaling Model Size
Scaling Computing Resource
Downstream Tasks Evaluation
Method
Code Search
Clone Detection
Model and Result
...and 14 more sections

Figures (7)

Figure 1: An example of the power-law with the log-log plot.
Figure 2: The test error distribution with regard to different amount of the test set.
Figure 3: The test error with regard to different scales in the code understanding task.
Figure 4: Performance of different scaling models on the code search task. (a) Different pre-training data; (b) Different model sizes; (c) Different pre-training computing resources. The x-axis is $log(scale)$, and the y-axis is MRR on CodeSearchNet.
Figure 5: Performance of different scaling models on the clone detection task. (a) Different pre-training data; (b) Different model sizes; (c) Different pre-training computing resources. The x-axis is $log(scale)$, and the y-axis is MAP@R for POJ-104 dataset.
...and 2 more figures

Scaling Laws Behind Code Understanding Model

TL;DR

Abstract

Scaling Laws Behind Code Understanding Model

Authors

TL;DR

Abstract

Table of Contents

Figures (7)