Table of Contents
Fetching ...

An Automatic Prompt Generation System for Tabular Data Tasks

Ashlesha Akella, Abhijit Manatkar, Brij Chavda, Hima Patel

TL;DR

This paper presents an innovative auto-prompt generation system suitable for multiple LLMs, with minimal training, and proposes two novel methods; a Reinforcement Learning-based algorithm for identifying and sequencing task-relevant columns and a Cell-level similarity-based approach for enhancing few-shot example selection.

Abstract

Efficient processing of tabular data is important in various industries, especially when working with datasets containing a large number of columns. Large language models (LLMs) have demonstrated their ability on several tasks through carefully crafted prompts. However, creating effective prompts for tabular datasets is challenging due to the structured nature of the data and the need to manage numerous columns. This paper presents an innovative auto-prompt generation system suitable for multiple LLMs, with minimal training. It proposes two novel methods; 1) A Reinforcement Learning-based algorithm for identifying and sequencing task-relevant columns 2) Cell-level similarity-based approach for enhancing few-shot example selection. Our approach has been extensively tested across 66 datasets, demonstrating improved performance in three downstream tasks: data imputation, error detection, and entity matching using two distinct LLMs; Google flan-t5-xxl and Mixtral 8x7B.

An Automatic Prompt Generation System for Tabular Data Tasks

TL;DR

This paper presents an innovative auto-prompt generation system suitable for multiple LLMs, with minimal training, and proposes two novel methods; a Reinforcement Learning-based algorithm for identifying and sequencing task-relevant columns and a Cell-level similarity-based approach for enhancing few-shot example selection.

Abstract

Efficient processing of tabular data is important in various industries, especially when working with datasets containing a large number of columns. Large language models (LLMs) have demonstrated their ability on several tasks through carefully crafted prompts. However, creating effective prompts for tabular datasets is challenging due to the structured nature of the data and the need to manage numerous columns. This paper presents an innovative auto-prompt generation system suitable for multiple LLMs, with minimal training. It proposes two novel methods; 1) A Reinforcement Learning-based algorithm for identifying and sequencing task-relevant columns 2) Cell-level similarity-based approach for enhancing few-shot example selection. Our approach has been extensively tested across 66 datasets, demonstrating improved performance in three downstream tasks: data imputation, error detection, and entity matching using two distinct LLMs; Google flan-t5-xxl and Mixtral 8x7B.
Paper Structure (19 sections, 9 equations, 4 figures, 6 tables)

This paper contains 19 sections, 9 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Example Prompt Template for Data Imputation task
  • Figure 2: Variations in accuracy across different combinations and permutations for manually selected columns for Data Imputation (DI) and Error detection (ED). We collected accuracies for all possible permutations of the selected columns (per dataset and per task) and visualized the distributions of accuracies.
  • Figure 3: The architecture comprises three modules: RL agent Training Module for Column Selection, Build Prompt Module and Evaluation.
  • Figure 4: The plot shows, reward accumulated by the RL-agent while undergoing training for each episode. The solid lines represent the average, and the shaded areas depict the highest and lowest test accuracy across 3 different seeds.