Novel Preprocessing Technique for Data Embedding in Engineering Code Generation Using Large Language Model

Yu-Chen Lin; Akhilesh Kumar; Norman Chang; Wenliang Zhang; Muhammad Zakir; Rucha Apte; Haiyang He; Chao Wang; Jyh-Shing Roger Jang

Novel Preprocessing Technique for Data Embedding in Engineering Code Generation Using Large Language Model

Yu-Chen Lin, Akhilesh Kumar, Norman Chang, Wenliang Zhang, Muhammad Zakir, Rucha Apte, Haiyang He, Chao Wang, Jyh-Shing Roger Jang

TL;DR

The paper tackles the challenge of domain-specific code generation for RedHawk-SC in the presence of sparse documentation by introducing a data-centric preprocessing pipeline that enhances embedding quality and retrieval in Retrieval-Augmented Generation (RAG). It combines Data Splitter, Data Renovation, Data Augmentation, Chain of Density for Renovation Credibility (CoDRC), Adaptive Text Renovation Algorithm (ATRA), and Implicit Knowledge Expansion and Contemplation (IKEC) within a ChatEDA-inspired workflow to generate higher-quality, domain-specific code without fine-tuning. Empirical results on RedHawk-SC scripts demonstrate substantial improvements, achieving a reported $73.33\%$ Percentage of Correct Lines in MapReduce applications when using RAG+IKEC. The work highlights practical implications for deploying engineering-domain code generation pipelines rapidly, with potential applicability beyond RH-SC to other specialized domains through data-driven prompts and renovation strategies.

Abstract

We present four main contributions to enhance the performance of Large Language Models (LLMs) in generating domain-specific code: (i) utilizing LLM-based data splitting and data renovation techniques to improve the semantic representation of embeddings' space; (ii) introducing the Chain of Density for Renovation Credibility (CoDRC), driven by LLMs, and the Adaptive Text Renovation (ATR) algorithm for assessing data renovation reliability; (iii) developing the Implicit Knowledge Expansion and Contemplation (IKEC) Prompt technique; and (iv) effectively refactoring existing scripts to generate new and high-quality scripts with LLMs. By using engineering simulation software RedHawk-SC as a case study, we demonstrate the effectiveness of our data pre-processing method for expanding and categorizing scripts. When combined with IKEC, these techniques enhance the Retrieval-Augmented Generation (RAG) method in retrieving more relevant information, ultimately achieving a 73.33% "Percentage of Correct Lines" for code generation problems in MapReduce applications.

Novel Preprocessing Technique for Data Embedding in Engineering Code Generation Using Large Language Model

TL;DR

Percentage of Correct Lines in MapReduce applications when using RAG+IKEC. The work highlights practical implications for deploying engineering-domain code generation pipelines rapidly, with potential applicability beyond RH-SC to other specialized domains through data-driven prompts and renovation strategies.

Abstract

Paper Structure (30 sections, 4 equations, 9 figures)

This paper contains 30 sections, 4 equations, 9 figures.

Introduction
Background
Models
Enhancing Input Tokens
Retrieval-Augmented Generation (RAG)
RAG vs Fine-tuning
Common Issues with the RAG Method -- Data Splitting
Prompt Techniques and Mechanisms
Distilling
Related Work in Code Generation
Methodology
Data Augmentation
Implicit Knowledge Expansion and Contemplation (IKEC)
Data Splitter
Data Renovation
...and 15 more sections

Figures (9)

Figure 1: Overall flowchart of the code generation process proposed in this study. The "Preprocessing" block includes three innovative data preprocessing techniques introduced in this paper. The Task Planner and Scripts Generator (green rounded rectangles) signify the processing performed by the LLM, combined with the RAG method.
Figure 2: Detailed schematic representations of the internal workings of the three methods discussed in this paper (Data Splitter, Data Renovation, Data Augmentation).
Figure 3: Code Generator Prompt - The Prompt used in the code generation process with the same System Prompt. The bold text should be replaced with the corresponding content, such as replacing "query" with the user's required sentence.
Figure 4: Data Augmentation Example - The source code of the two Scripts used.
Figure 5: Example (one of the experimental results demonstration) - Data Augmentation, a new script generated using the above two source codes and applying the LLM method. (The light blue and green backgrounds represent parts of the code with a structure similar to the corresponding colored source code in Fig. \ref{['fig:sourcecodedataaugmentation']}, pink indicates code generated from data obtained by referencing the RAG method, and yellow denotes code generated using basic Python logic.)
...and 4 more figures

Novel Preprocessing Technique for Data Embedding in Engineering Code Generation Using Large Language Model

TL;DR

Abstract

Novel Preprocessing Technique for Data Embedding in Engineering Code Generation Using Large Language Model

Authors

TL;DR

Abstract

Table of Contents

Figures (9)