Table of Contents
Fetching ...

S2LPP: Small-to-Large Prompt Prediction across LLMs

Liang Cheng, Tianyi LI, Zhaowei Wang, Mark Steedman

TL;DR

This work addresses the high cost and sensitivity of prompt engineering for large language models by showing that optimal prompts exhibit consistency across model sizes within the same family and to some extent across families. It introduces Small-to-Large Prompt Prediction (S2LPP), a three-step framework that uses small LLMs to generate and select high-performing prompts for a larger target model, substantially reducing computation while achieving near-oracle performance on open-domain QA and natural language inference across 14 LLMs. The method is validated on QA and NLI, with extensions to retrieval-augmented generation and arithmetic reasoning, illustrating robustness and generalizability to broader NLP tasks. The results demonstrate that leveraging prompt-consistency can dramatically cut prompt-engineering costs while maintaining high performance, offering practical benefits for deploying diverse LLMs in real-world settings.

Abstract

The performance of pre-trained Large Language Models (LLMs) is often sensitive to nuances in prompt templates, requiring careful prompt engineering, adding costs in terms of computing and human effort. In this study, we present experiments encompassing multiple LLMs variants of varying sizes aimed at probing their preference with different prompts. Through experiments on Question Answering, we show prompt preference consistency across LLMs of different sizes. We also show that this consistency extends to other tasks, such as Natural Language Inference. Utilizing this consistency, we propose a method to use a smaller model to select effective prompt templates for a larger model. We show that our method substantially reduces the cost of prompt engineering while consistently matching performance with optimal prompts among candidates. More importantly, our experiment shows the efficacy of our strategy across fourteen LLMs and its applicability to a broad range of NLP tasks, highlighting its robustness

S2LPP: Small-to-Large Prompt Prediction across LLMs

TL;DR

This work addresses the high cost and sensitivity of prompt engineering for large language models by showing that optimal prompts exhibit consistency across model sizes within the same family and to some extent across families. It introduces Small-to-Large Prompt Prediction (S2LPP), a three-step framework that uses small LLMs to generate and select high-performing prompts for a larger target model, substantially reducing computation while achieving near-oracle performance on open-domain QA and natural language inference across 14 LLMs. The method is validated on QA and NLI, with extensions to retrieval-augmented generation and arithmetic reasoning, illustrating robustness and generalizability to broader NLP tasks. The results demonstrate that leveraging prompt-consistency can dramatically cut prompt-engineering costs while maintaining high performance, offering practical benefits for deploying diverse LLMs in real-world settings.

Abstract

The performance of pre-trained Large Language Models (LLMs) is often sensitive to nuances in prompt templates, requiring careful prompt engineering, adding costs in terms of computing and human effort. In this study, we present experiments encompassing multiple LLMs variants of varying sizes aimed at probing their preference with different prompts. Through experiments on Question Answering, we show prompt preference consistency across LLMs of different sizes. We also show that this consistency extends to other tasks, such as Natural Language Inference. Utilizing this consistency, we propose a method to use a smaller model to select effective prompt templates for a larger model. We show that our method substantially reduces the cost of prompt engineering while consistently matching performance with optimal prompts among candidates. More importantly, our experiment shows the efficacy of our strategy across fourteen LLMs and its applicability to a broad range of NLP tasks, highlighting its robustness

Paper Structure

This paper contains 47 sections, 1 equation, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Accuracy of different prompts across LLaMA-3 and DeepSeek-R1 models on Google-RE. The x-axis represents the various prompts being evaluated. The solid bar indicate the optimal prompt for each respective LLMs.
  • Figure 2: The figure illustrates the accuracy of different prompts across LLaMA-3 and DeepSeek models of varying sizes on the directional Levy/Holt (NLI task). The x-axis represents the various candidate prompts, while the solid bar represents the optimal prompt for each LLM.
  • Figure 3: The workflow of S2LPP on open-domain QA: Step 1: For each relation, we utilize the prompt-generation model to produce top-k candidate prompts. Step 2: We employ the small Selection Model to discern the optimal prompt from candidates. Step 3: We use the selected prompt to ask questions. Subsequently, we employ the Target Model to provide responses to these questions.
  • Figure 4: The Recovery Rate of Performance (RRoP) across various LLMs on QA tasks. RRoP scores exceeding 70% are highlighted in red.
  • Figure 5: Accuracy of different models in the prompt selection step for QA. The green column represents the baseline using the first-generated prompt, while the red column illustrates the accuracy with the oracle prompt, which is the upper bound of the target model (GPT-3.5).
  • ...and 1 more figures