Table of Contents
Fetching ...

A Comparative Study of DSL Code Generation: Fine-Tuning vs. Optimized Retrieval Augmentation

Nastaran Bassamzadeh, Chhaya Methani

TL;DR

This work tackles NL-to-DSL generation in enterprise automation where DSLs use many custom API names and evolve rapidly. It compares a strong fine-tuned Codex baseline against optimized Retrieval-Augmented Generation (RAG) strategies that dynamically assemble few-shot prompts and incorporate API metadata. Results show the fine-tuned model achieves the best code similarity, while optimized RAG can reach parity in similarity and improve syntactic correctness and hallucination resilience, particularly for unseen APIs. The findings suggest that carefully grounded RAG can provide a scalable alternative to frequent fine-tuning in DSL generation tasks.

Abstract

Natural Language to Code Generation has made significant progress in recent years with the advent of Large Language Models(LLMs). While generation for general-purpose languages like C, C++, and Python has improved significantly, LLMs struggle with custom function names in Domain Specific Languages or DSLs. This leads to higher hallucination rates and syntax errors, specially for DSLs having a high number of custom function names. Additionally, constant updates to function names add to the challenge as LLMs need to stay up-to-date. In this paper, we present optimizations for using Retrieval Augmented Generation (or RAG) with LLMs for DSL generation along with an ablation study comparing these strategies. We generated a train as well as test dataset with a DSL to represent automation tasks across roughly 700 APIs in public domain. We used the training dataset to fine-tune a Codex model for this DSL. Our results showed that the fine-tuned model scored the best on code similarity metric. With our RAG optimizations, we achieved parity for similarity metric. The compilation rate, however, showed that both the models still got the syntax wrong many times, with RAG-based method being 2 pts better. Conversely, hallucination rate for RAG model lagged by 1 pt for API names and by 2 pts for API parameter keys. We conclude that an optimized RAG model can match the quality of fine-tuned models and offer advantages for new, unseen APIs.

A Comparative Study of DSL Code Generation: Fine-Tuning vs. Optimized Retrieval Augmentation

TL;DR

This work tackles NL-to-DSL generation in enterprise automation where DSLs use many custom API names and evolve rapidly. It compares a strong fine-tuned Codex baseline against optimized Retrieval-Augmented Generation (RAG) strategies that dynamically assemble few-shot prompts and incorporate API metadata. Results show the fine-tuned model achieves the best code similarity, while optimized RAG can reach parity in similarity and improve syntactic correctness and hallucination resilience, particularly for unseen APIs. The findings suggest that carefully grounded RAG can provide a scalable alternative to frequent fine-tuning in DSL generation tasks.

Abstract

Natural Language to Code Generation has made significant progress in recent years with the advent of Large Language Models(LLMs). While generation for general-purpose languages like C, C++, and Python has improved significantly, LLMs struggle with custom function names in Domain Specific Languages or DSLs. This leads to higher hallucination rates and syntax errors, specially for DSLs having a high number of custom function names. Additionally, constant updates to function names add to the challenge as LLMs need to stay up-to-date. In this paper, we present optimizations for using Retrieval Augmented Generation (or RAG) with LLMs for DSL generation along with an ablation study comparing these strategies. We generated a train as well as test dataset with a DSL to represent automation tasks across roughly 700 APIs in public domain. We used the training dataset to fine-tune a Codex model for this DSL. Our results showed that the fine-tuned model scored the best on code similarity metric. With our RAG optimizations, we achieved parity for similarity metric. The compilation rate, however, showed that both the models still got the syntax wrong many times, with RAG-based method being 2 pts better. Conversely, hallucination rate for RAG model lagged by 1 pt for API names and by 2 pts for API parameter keys. We conclude that an optimized RAG model can match the quality of fine-tuned models and offer advantages for new, unseen APIs.
Paper Structure (27 sections, 5 equations, 1 figure, 4 tables)

This paper contains 27 sections, 5 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: System Architecture to show e2e working of & our DSL generation methodology using RAG. TST based semantic mapping & retrieves the relevant code snippet as shown. This helps get the right syntax. However, & it gets the correct function name for approval from the API metadata&