Table of Contents
Fetching ...

Go-UT-Bench: A Fine-Tuning Dataset for LLM-Based Unit Test Generation in Go

Yashshi Pipalani, Hritik Raj, Rajat Ghosh, Vaishnavi Bhargava, Debojyoti Dutta

TL;DR

Go-UT-Bench addresses the data bottleneck for domain-specific code LLMs by providing a large, diverse Go code–unit-test dataset with reproducible metadata. The authors introduce an AST-guided chunking pipeline to handle long Go files and evaluate fine-tuning effects on two LLM families (MoE and dense decoders) using an oracle-based unit-test generation benchmark. Fine-tuning yields substantial improvements in unit-test generation, with win rates exceeding 75% across repositories, demonstrating the value of domain-specific data for Go. The work advances open research in AI-powered software engineering by enabling reproducible Go unit-test generation and highlighting directions for cross-language adaptation and concurrency-aware prompting.

Abstract

Training data imbalance poses a major challenge for code LLMs. Most available data heavily over represents raw opensource code while underrepresenting broader software engineering tasks, especially in low resource languages like Golang. As a result, models excel at code autocompletion but struggle with real world developer workflows such as unit test generation. To address this gap, we introduce GO UT Bench, a benchmark dataset of 5264 pairs of code and unit tests, drawn from 10 permissively licensed Golang repositories spanning diverse domain. We evaluate its effectiveness as a fine tuning dataset across two LLM families i.e. mixture of experts and dense decoders. Our results show that finetuned models outperform their base counterparts on more than 75% of benchmark tasks.

Go-UT-Bench: A Fine-Tuning Dataset for LLM-Based Unit Test Generation in Go

TL;DR

Go-UT-Bench addresses the data bottleneck for domain-specific code LLMs by providing a large, diverse Go code–unit-test dataset with reproducible metadata. The authors introduce an AST-guided chunking pipeline to handle long Go files and evaluate fine-tuning effects on two LLM families (MoE and dense decoders) using an oracle-based unit-test generation benchmark. Fine-tuning yields substantial improvements in unit-test generation, with win rates exceeding 75% across repositories, demonstrating the value of domain-specific data for Go. The work advances open research in AI-powered software engineering by enabling reproducible Go unit-test generation and highlighting directions for cross-language adaptation and concurrency-aware prompting.

Abstract

Training data imbalance poses a major challenge for code LLMs. Most available data heavily over represents raw opensource code while underrepresenting broader software engineering tasks, especially in low resource languages like Golang. As a result, models excel at code autocompletion but struggle with real world developer workflows such as unit test generation. To address this gap, we introduce GO UT Bench, a benchmark dataset of 5264 pairs of code and unit tests, drawn from 10 permissively licensed Golang repositories spanning diverse domain. We evaluate its effectiveness as a fine tuning dataset across two LLM families i.e. mixture of experts and dense decoders. Our results show that finetuned models outperform their base counterparts on more than 75% of benchmark tasks.

Paper Structure

This paper contains 17 sections, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: PCA based analysis of Go-UT-Bench revealing the internal structure of the dataset.
  • Figure 2: The diversity in {code, unit test} pairs in terms of line lengths across 10 different opensource repositories in Go-UT-Bench.
  • Figure 3: Prompt for pairwise evaluation of two LLM generated responses (Assistant A and Assistant B) w.r.t. the ground truth.
  • Figure 4: Fine-tuning results for deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct. The left plot compares the win rates between the fine-tuned and the base models across all validation pairs in Go-UT-Bench, while the right plot illustrates the distribution of repositories within the validation dataset.
  • Figure 5: Fine-tuning results for meta-llama/Llama-3.2-3B-Instruct. The left plot compares the win rates between the fine-tuned and the base models across all validation pairs in Go-UT-Bench, while the right plot displays the distribution of repositories within the validation dataset.
  • ...and 5 more figures