Table of Contents
Fetching ...

Kotlin ML Pack: Technical Report

Sergey Titov, Mikhail Evtikhiev, Anton Shapkin, Oleg Smirnov, Sergei Boytsov, Sergei Boytsov, Dariia Karaeva, Maksim Sheptyakov, Mikhail Arkhipov, Timofey Bryksin, Egor Bogomolov

TL;DR

This work tackles the scarcity of high-quality Kotlin data for language modeling by introducing the Kotlin ML Pack, including KStack, KStack-clean, KExercises, and a Kotlin HumanEval benchmark. It demonstrates that targeted, high-quality data and instruction-focused tasks significantly improve Kotlin code generation and understanding, with up to a substantial pass-rate increase on HumanEval when fine-tuned on the right datasets. The results highlight the importance of data quality over sheer volume and discuss practical steps such as classifier-based filtering and synthetic translation to widen Kotlin coverage. The contributions provide a replicable blueprint for advancing code generation in Kotlin and other low-resource languages, with clear paths for future work using tools, synthetic data, and richer benchmarks.

Abstract

In this technical report, we present three novel datasets of Kotlin code: KStack, KStack-clean, and KExercises. We also describe the results of fine-tuning CodeLlama and DeepSeek models on this data. Additionally, we present a version of the HumanEval benchmark rewritten by human experts into Kotlin - both the solutions and the tests. Our results demonstrate that small, high-quality datasets (KStack-clean and KExercises) can significantly improve model performance on code generation tasks, achieving up to a 16-point increase in pass rate on the HumanEval benchmark. Lastly, we discuss potential future work in the field of improving language modeling for Kotlin, including the use of static analysis tools in the learning process and the introduction of more intricate and realistic benchmarks.

Kotlin ML Pack: Technical Report

TL;DR

This work tackles the scarcity of high-quality Kotlin data for language modeling by introducing the Kotlin ML Pack, including KStack, KStack-clean, KExercises, and a Kotlin HumanEval benchmark. It demonstrates that targeted, high-quality data and instruction-focused tasks significantly improve Kotlin code generation and understanding, with up to a substantial pass-rate increase on HumanEval when fine-tuned on the right datasets. The results highlight the importance of data quality over sheer volume and discuss practical steps such as classifier-based filtering and synthetic translation to widen Kotlin coverage. The contributions provide a replicable blueprint for advancing code generation in Kotlin and other low-resource languages, with clear paths for future work using tools, synthetic data, and richer benchmarks.

Abstract

In this technical report, we present three novel datasets of Kotlin code: KStack, KStack-clean, and KExercises. We also describe the results of fine-tuning CodeLlama and DeepSeek models on this data. Additionally, we present a version of the HumanEval benchmark rewritten by human experts into Kotlin - both the solutions and the tests. Our results demonstrate that small, high-quality datasets (KStack-clean and KExercises) can significantly improve model performance on code generation tasks, achieving up to a 16-point increase in pass rate on the HumanEval benchmark. Lastly, we discuss potential future work in the field of improving language modeling for Kotlin, including the use of static analysis tools in the learning process and the introduction of more intricate and realistic benchmarks.
Paper Structure (29 sections, 1 equation, 3 figures, 2 tables)

This paper contains 29 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Kotlin and Python HumanEval Scores for various models.
  • Figure 2: Prompt for Python to Kotlin translation.
  • Figure 3: Pass rate on HumanEval for Kotlin for different filtration strategies of the KStack-clean dataset, finetuning the CodeLlama-7B model.