On The Role of Prompt Construction In Enhancing Efficacy and Efficiency of LLM-Based Tabular Data Generation

Banooqa Banday; Kowshik Thopalli; Tanzima Z. Islam; Jayaraman J. Thiagarajan

On The Role of Prompt Construction In Enhancing Efficacy and Efficiency of LLM-Based Tabular Data Generation

Banooqa Banday, Kowshik Thopalli, Tanzima Z. Islam, Jayaraman J. Thiagarajan

TL;DR

The study tackles the semantically weak feature-name problem in LLM-based tabular data generation by introducing context-enriched prompt construction protocols. It presents three strategies—Expert-guided, LLM-guided, and Novel-Mapping—and shows that enriched prompts improve both data quality (MLE-related performance) and training efficiency, with gains persisting under LoRA parameter-efficient fine-tuning. Across four real-world datasets and two LLMs, semantic prompting yields notable improvements (e.g., up to several percentage points in accuracy and significant MSE reductions), and Novel-Mapping proves particularly effective when feature names offer little to no semantic context. These findings have practical significance for scalable synthetic data generation in domains with varying levels of feature-name interpretability, enabling more reliable data augmentation and privacy-preserving analytics.

Abstract

LLM-based data generation for real-world tabular data can be challenged by the lack of sufficient semantic context in feature names used to describe columns. We hypothesize that enriching prompts with domain-specific insights can improve both the quality and efficiency of data generation. To test this hypothesis, we explore three prompt construction protocols: Expert-guided, LLM-guided, and Novel-Mapping. Through empirical studies with the recently proposed GReaT framework, we find that context-enriched prompts lead to significantly improved data generation quality and training efficiency.

On The Role of Prompt Construction In Enhancing Efficacy and Efficiency of LLM-Based Tabular Data Generation

TL;DR

Abstract

Paper Structure (15 sections, 4 figures, 2 tables)

This paper contains 15 sections, 4 figures, 2 tables.

Introduction
Background
Proposed Work
Prompt Construction Protocols
LLM Fine-tuning for Data Generation
Implementation
Experimental Setup
Results and Findings
Conclusions
Limitations
Detailed Descriptions of the Datasets
HELOC (Home Equity Line Of Credit)
Magic Gamma Telescope
California Housing
Parkinsons Diagnosis

Figures (4)

Figure 1: An overview of our approach for LLM-based tabular data generation. Our contributions include designing new prompt construction strategies and investigating their role in improving the quality of synthesized samples.
Figure 2: Enhanced prompt construction strategies lead to better computational efficiency.
Figure 3: Performance of ML models trained on synthetic data, generated by fine-tuning GPT-2 with LoRA using various prompting methods, evaluated on the Magic Telescope and Parkinson's diagnosis datasets.
Figure 4: Mapping generic feature names to semantically meaningful descriptors from a novel domain provides non-trivial gains in performance.

On The Role of Prompt Construction In Enhancing Efficacy and Efficiency of LLM-Based Tabular Data Generation

TL;DR

Abstract

On The Role of Prompt Construction In Enhancing Efficacy and Efficiency of LLM-Based Tabular Data Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)