Table of Contents
Fetching ...

MAPO: Boosting Large Language Model Performance with Model-Adaptive Prompt Optimization

Yuyan Chen, Zhihao Wen, Ge Fan, Zhengyu Chen, Wei Wu, Dayiheng Liu, Zhixu Li, Bang Liu, Yanghua Xiao

TL;DR

This paper identifies that prompts are not universally optimal across LLMs and proposes MAPO, a framework to tailor prompts to each model for improved NLP performance. MAPO combines a warm-up dataset, supervised fine-tuning, a learned reward model, and reinforcement learning (PPO with model feedback) to generate model-adaptive prompts $P_o$ from originals $P$. Through extensive experiments on BLOOM-7B, GPT-J-6B, and LLaMA-7B across QA, classification, and generation tasks, MAPO demonstrates robust improvements and notable domain transfer capabilities, while ablation studies highlight the value of RL and the importance of maintaining generalization. While effective, MAPO requires substantial warm-up data and computational resources, guiding future work toward more efficient training and broader applicability across languages and tasks.

Abstract

Prompt engineering, as an efficient and effective way to leverage Large Language Models (LLM), has drawn a lot of attention from the research community. The existing research primarily emphasizes the importance of adapting prompts to specific tasks, rather than specific LLMs. However, a good prompt is not solely defined by its wording, but also binds to the nature of the LLM in question. In this work, we first quantitatively demonstrate that different prompts should be adapted to different LLMs to enhance their capabilities across various downstream tasks in NLP. Then we novelly propose a model-adaptive prompt optimizer (MAPO) method that optimizes the original prompts for each specific LLM in downstream tasks. Extensive experiments indicate that the proposed method can effectively refine prompts for an LLM, leading to significant improvements over various downstream tasks.

MAPO: Boosting Large Language Model Performance with Model-Adaptive Prompt Optimization

TL;DR

This paper identifies that prompts are not universally optimal across LLMs and proposes MAPO, a framework to tailor prompts to each model for improved NLP performance. MAPO combines a warm-up dataset, supervised fine-tuning, a learned reward model, and reinforcement learning (PPO with model feedback) to generate model-adaptive prompts from originals . Through extensive experiments on BLOOM-7B, GPT-J-6B, and LLaMA-7B across QA, classification, and generation tasks, MAPO demonstrates robust improvements and notable domain transfer capabilities, while ablation studies highlight the value of RL and the importance of maintaining generalization. While effective, MAPO requires substantial warm-up data and computational resources, guiding future work toward more efficient training and broader applicability across languages and tasks.

Abstract

Prompt engineering, as an efficient and effective way to leverage Large Language Models (LLM), has drawn a lot of attention from the research community. The existing research primarily emphasizes the importance of adapting prompts to specific tasks, rather than specific LLMs. However, a good prompt is not solely defined by its wording, but also binds to the nature of the LLM in question. In this work, we first quantitatively demonstrate that different prompts should be adapted to different LLMs to enhance their capabilities across various downstream tasks in NLP. Then we novelly propose a model-adaptive prompt optimizer (MAPO) method that optimizes the original prompts for each specific LLM in downstream tasks. Extensive experiments indicate that the proposed method can effectively refine prompts for an LLM, leading to significant improvements over various downstream tasks.
Paper Structure (24 sections, 10 equations, 9 figures, 19 tables)

This paper contains 24 sections, 10 equations, 9 figures, 19 tables.

Figures (9)

  • Figure 1: Variance on answers from different LLMs (b) when they are given the same task-specific prompts (a).
  • Figure 2: The performance of different LLMs on task-specific prompts for three tasks: question-answering (a), classification (b), and generation (c). The results reveal significant variations across different LLMs' performance.
  • Figure 3: Framework of the proposed MAPO, including warm-up dataset establishment and prompt optimizer construction.
  • Figure 4: The performance of the reward model in three LLMs during the training process of MAPO.
  • Figure 5: Performance of different proportion of warm-up dataset in various downstream tasks by three LLMs. Q: QA, C: classification, G:generation. We only keep the first four letters of each dataset's name in the figure.
  • ...and 4 more figures