Table of Contents
Fetching ...

ADO: Automatic Data Optimization for Inputs in LLM Prompts

Sam Lin, Wenyue Hua, Lingyao Li, Zhenting Wang, Yongfeng Zhang

TL;DR

The paper studies how the content and formatting of prompt input data influence LLM performance and proposes automatic input-data optimization (ADO) to address this. It defines a two-pronged approach with content optimization (imputation, attribute filtering, enrichment) and format optimization (data presentation) and encodes the optimized input as $D' = f_{\text{format}}(f_{\text{content}}(D))$. ADO employs a three-LLM workflow (Prompt-Generation, Data-Optimization, and Task-Inference) with an Objective Evaluator that iteratively refines data-optimization prompts, aided by Diverse Prompt Search (DPS) which enforces semantic and lexical diversity via constraints $c_1$, $c_2$ and Bayesian-tuned hyperparameters. Empirical results across nine real-world datasets and multiple backbones show that ADO consistently improves performance, and that combining ADO with existing prompt-engineering techniques (CoT, ICL, PE2) yields further gains, highlighting practical value for augmenting LLM inference with optimized input data.

Abstract

This study explores a novel approach to enhance the performance of Large Language Models (LLMs) through the optimization of input data within prompts. While previous research has primarily focused on refining instruction components and augmenting input data with in-context examples, our work investigates the potential benefits of optimizing the input data itself. We introduce a two-pronged strategy for input data optimization: content engineering and structural reformulation. Content engineering involves imputing missing values, removing irrelevant attributes, and enriching profiles by generating additional information inferred from existing attributes. Subsequent to content engineering, structural reformulation is applied to optimize the presentation of the modified content to LLMs, given their sensitivity to input format. Our findings suggest that these optimizations can significantly improve the performance of LLMs in various tasks, offering a promising avenue for future research in prompt engineering. The source code is available at https://anonymous.4open.science/r/ADO-6BC5/

ADO: Automatic Data Optimization for Inputs in LLM Prompts

TL;DR

The paper studies how the content and formatting of prompt input data influence LLM performance and proposes automatic input-data optimization (ADO) to address this. It defines a two-pronged approach with content optimization (imputation, attribute filtering, enrichment) and format optimization (data presentation) and encodes the optimized input as . ADO employs a three-LLM workflow (Prompt-Generation, Data-Optimization, and Task-Inference) with an Objective Evaluator that iteratively refines data-optimization prompts, aided by Diverse Prompt Search (DPS) which enforces semantic and lexical diversity via constraints , and Bayesian-tuned hyperparameters. Empirical results across nine real-world datasets and multiple backbones show that ADO consistently improves performance, and that combining ADO with existing prompt-engineering techniques (CoT, ICL, PE2) yields further gains, highlighting practical value for augmenting LLM inference with optimized input data.

Abstract

This study explores a novel approach to enhance the performance of Large Language Models (LLMs) through the optimization of input data within prompts. While previous research has primarily focused on refining instruction components and augmenting input data with in-context examples, our work investigates the potential benefits of optimizing the input data itself. We introduce a two-pronged strategy for input data optimization: content engineering and structural reformulation. Content engineering involves imputing missing values, removing irrelevant attributes, and enriching profiles by generating additional information inferred from existing attributes. Subsequent to content engineering, structural reformulation is applied to optimize the presentation of the modified content to LLMs, given their sensitivity to input format. Our findings suggest that these optimizations can significantly improve the performance of LLMs in various tasks, offering a promising avenue for future research in prompt engineering. The source code is available at https://anonymous.4open.science/r/ADO-6BC5/

Paper Structure

This paper contains 29 sections, 3 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Types of prompt engineering approaches. Given an inference task, such as solving a logical puzzle (as shown in the middle of the figure), prior works primarily focus on either optimizing instructions or augmenting the input data with similar examples, as depicted at the top of the figure. In contrast, we propose optimizing the input data to enhance its presentation to LLMs for more effective task inference, as illustrated at the bottom of the figure.
  • Figure 2: ADO Workflow. The Prompt-Generation LLM initially proposes task-specific instructions for optimizing input data, which the Data Optimization LLM executes on validation set samples, generating optimized inputs. These optimized samples are then processed by the Task Inference LLM to produce task predictions. The Objective Evaluator compares these predictions against the expected outputs (ground truth) using task-specific metrics to compute a score. This score represents the quality of the data optimization instructions, with prior prompt-score pairs provided as additional context to the Prompt-Generation LLM for refining instructions in future iterations.