Novel Preprocessing Technique for Data Embedding in Engineering Code Generation Using Large Language Model
Yu-Chen Lin, Akhilesh Kumar, Norman Chang, Wenliang Zhang, Muhammad Zakir, Rucha Apte, Haiyang He, Chao Wang, Jyh-Shing Roger Jang
TL;DR
The paper tackles the challenge of domain-specific code generation for RedHawk-SC in the presence of sparse documentation by introducing a data-centric preprocessing pipeline that enhances embedding quality and retrieval in Retrieval-Augmented Generation (RAG). It combines Data Splitter, Data Renovation, Data Augmentation, Chain of Density for Renovation Credibility (CoDRC), Adaptive Text Renovation Algorithm (ATRA), and Implicit Knowledge Expansion and Contemplation (IKEC) within a ChatEDA-inspired workflow to generate higher-quality, domain-specific code without fine-tuning. Empirical results on RedHawk-SC scripts demonstrate substantial improvements, achieving a reported $73.33\%$ Percentage of Correct Lines in MapReduce applications when using RAG+IKEC. The work highlights practical implications for deploying engineering-domain code generation pipelines rapidly, with potential applicability beyond RH-SC to other specialized domains through data-driven prompts and renovation strategies.
Abstract
We present four main contributions to enhance the performance of Large Language Models (LLMs) in generating domain-specific code: (i) utilizing LLM-based data splitting and data renovation techniques to improve the semantic representation of embeddings' space; (ii) introducing the Chain of Density for Renovation Credibility (CoDRC), driven by LLMs, and the Adaptive Text Renovation (ATR) algorithm for assessing data renovation reliability; (iii) developing the Implicit Knowledge Expansion and Contemplation (IKEC) Prompt technique; and (iv) effectively refactoring existing scripts to generate new and high-quality scripts with LLMs. By using engineering simulation software RedHawk-SC as a case study, we demonstrate the effectiveness of our data pre-processing method for expanding and categorizing scripts. When combined with IKEC, these techniques enhance the Retrieval-Augmented Generation (RAG) method in retrieving more relevant information, ultimately achieving a 73.33% "Percentage of Correct Lines" for code generation problems in MapReduce applications.
