ADO: Automatic Data Optimization for Inputs in LLM Prompts
Sam Lin, Wenyue Hua, Lingyao Li, Zhenting Wang, Yongfeng Zhang
TL;DR
The paper studies how the content and formatting of prompt input data influence LLM performance and proposes automatic input-data optimization (ADO) to address this. It defines a two-pronged approach with content optimization (imputation, attribute filtering, enrichment) and format optimization (data presentation) and encodes the optimized input as $D' = f_{\text{format}}(f_{\text{content}}(D))$. ADO employs a three-LLM workflow (Prompt-Generation, Data-Optimization, and Task-Inference) with an Objective Evaluator that iteratively refines data-optimization prompts, aided by Diverse Prompt Search (DPS) which enforces semantic and lexical diversity via constraints $c_1$, $c_2$ and Bayesian-tuned hyperparameters. Empirical results across nine real-world datasets and multiple backbones show that ADO consistently improves performance, and that combining ADO with existing prompt-engineering techniques (CoT, ICL, PE2) yields further gains, highlighting practical value for augmenting LLM inference with optimized input data.
Abstract
This study explores a novel approach to enhance the performance of Large Language Models (LLMs) through the optimization of input data within prompts. While previous research has primarily focused on refining instruction components and augmenting input data with in-context examples, our work investigates the potential benefits of optimizing the input data itself. We introduce a two-pronged strategy for input data optimization: content engineering and structural reformulation. Content engineering involves imputing missing values, removing irrelevant attributes, and enriching profiles by generating additional information inferred from existing attributes. Subsequent to content engineering, structural reformulation is applied to optimize the presentation of the modified content to LLMs, given their sensitivity to input format. Our findings suggest that these optimizations can significantly improve the performance of LLMs in various tasks, offering a promising avenue for future research in prompt engineering. The source code is available at https://anonymous.4open.science/r/ADO-6BC5/
