Exploring different approaches to customize language models for domain-specific text-to-code generation

Luís Freire; Fernanda A. Andaló; Nicki Skafte Detlefsen

Exploring different approaches to customize language models for domain-specific text-to-code generation

Luís Freire, Fernanda A. Andaló, Nicki Skafte Detlefsen

Abstract

Large language models (LLMs) have demonstrated strong capabilities in generating executable code from natural language descriptions. However, general-purpose models often struggle in specialized programming contexts where domain-specific libraries, APIs, or conventions must be used. Customizing smaller open-source models offers a cost-effective alternative to relying on large proprietary systems. In this work, we investigate how smaller language models can be adapted for domain-specific code generation using synthetic datasets. We construct datasets of programming exercises across three domains within the Python ecosystem: general Python programming, Scikit-learn machine learning workflows, and OpenCV-based computer vision tasks. Using these datasets, we evaluate three customization strategies: few-shot prompting, retrieval-augmented generation (RAG), and parameter-efficient fine-tuning using Low-Rank Adaptation (LoRA). Performance is evaluated using both benchmark-based metrics and similarity-based metrics that measure alignment with domain-specific code. Our results show that prompting-based approaches such as few-shot learning and RAG can improve domain relevance in a cost-effective manner, although their impact on benchmark accuracy is limited. In contrast, LoRA-based fine-tuning consistently achieves higher accuracy and stronger domain alignment across most tasks. These findings highlight practical trade-offs between flexibility, computational cost, and performance when adapting smaller language models for specialized programming tasks.

Exploring different approaches to customize language models for domain-specific text-to-code generation

Abstract

Paper Structure (19 sections, 6 figures, 2 tables)

This paper contains 19 sections, 6 figures, 2 tables.

Introduction
Background
Overview of the customization pipeline
Synthetic dataset construction
Prompt engineering
Dataset Analysis
Dataset Validation
Dataset split
Language model customization
Base models
Customization techniques
Inference and evaluation setup
Results
Baseline performance
Few-shot learning
...and 4 more sections

Figures (6)

Figure 1: Overview of the customization pipeline. Synthetic programming exercises are generated by a teacher LLM and filtered through a validation stage. The resulting datasets are used to adapt smaller models via few-shot prompting, RAG, and LoRA-based fine-tuning. The adapted models are evaluated using benchmark and similarity metrics.
Figure 2: Prompt template used for dataset generation
Figure 3: Example of a generated programming exercise.
Figure 4: Sample length distribution across the three domains.
Figure 5: Training dynamics during LoRA fine-tuning for DeepSeekCoder on Python tasks ($\alpha = r = 128$). The plot shows validation similarity and HumanEval Pass@1 during training. Increases in validation similarity correlate with improvements in benchmark accuracy.
...and 1 more figures

Exploring different approaches to customize language models for domain-specific text-to-code generation

Abstract

Exploring different approaches to customize language models for domain-specific text-to-code generation

Authors

Abstract

Table of Contents

Figures (6)