Kotlin ML Pack: Technical Report

Sergey Titov; Mikhail Evtikhiev; Anton Shapkin; Oleg Smirnov; Sergei Boytsov; Sergei Boytsov; Dariia Karaeva; Maksim Sheptyakov; Mikhail Arkhipov; Timofey Bryksin; Egor Bogomolov

Kotlin ML Pack: Technical Report

Sergey Titov, Mikhail Evtikhiev, Anton Shapkin, Oleg Smirnov, Sergei Boytsov, Sergei Boytsov, Dariia Karaeva, Maksim Sheptyakov, Mikhail Arkhipov, Timofey Bryksin, Egor Bogomolov

TL;DR

This work tackles the scarcity of high-quality Kotlin data for language modeling by introducing the Kotlin ML Pack, including KStack, KStack-clean, KExercises, and a Kotlin HumanEval benchmark. It demonstrates that targeted, high-quality data and instruction-focused tasks significantly improve Kotlin code generation and understanding, with up to a substantial pass-rate increase on HumanEval when fine-tuned on the right datasets. The results highlight the importance of data quality over sheer volume and discuss practical steps such as classifier-based filtering and synthetic translation to widen Kotlin coverage. The contributions provide a replicable blueprint for advancing code generation in Kotlin and other low-resource languages, with clear paths for future work using tools, synthetic data, and richer benchmarks.

Abstract

In this technical report, we present three novel datasets of Kotlin code: KStack, KStack-clean, and KExercises. We also describe the results of fine-tuning CodeLlama and DeepSeek models on this data. Additionally, we present a version of the HumanEval benchmark rewritten by human experts into Kotlin - both the solutions and the tests. Our results demonstrate that small, high-quality datasets (KStack-clean and KExercises) can significantly improve model performance on code generation tasks, achieving up to a 16-point increase in pass rate on the HumanEval benchmark. Lastly, we discuss potential future work in the field of improving language modeling for Kotlin, including the use of static analysis tools in the learning process and the introduction of more intricate and realistic benchmarks.

Kotlin ML Pack: Technical Report

TL;DR

Abstract

Paper Structure (29 sections, 1 equation, 3 figures, 2 tables)

This paper contains 29 sections, 1 equation, 3 figures, 2 tables.

Introduction
Current state of the art of Kotlin code generation
Kotlin data
KStack: Kotlin language corpus
KStack-clean: Learning the code quality
Comparison with other datasets
KExercises: Kotlin instructions dataset
Kotlin evaluation
HumanEval for Kotlin
Evaluation setup for code generation
Evaluation setup for code completion
Learning Kotlin
Base models
Datasets
KStack
...and 14 more sections

Figures (3)

Figure 1: Kotlin and Python HumanEval Scores for various models.
Figure 2: Prompt for Python to Kotlin translation.
Figure 3: Pass rate on HumanEval for Kotlin for different filtration strategies of the KStack-clean dataset, finetuning the CodeLlama-7B model.

Kotlin ML Pack: Technical Report

TL;DR

Abstract

Kotlin ML Pack: Technical Report

Authors

TL;DR

Abstract

Table of Contents

Figures (3)