APEER: Automatic Prompt Engineering Enhances Large Language Model Reranking

Can Jin; Hongwu Peng; Shiyu Zhao; Zhenting Wang; Wujiang Xu; Ligong Han; Jiahui Zhao; Kai Zhong; Sanguthevar Rajasekaran; Dimitris N. Metaxas

APEER: Automatic Prompt Engineering Enhances Large Language Model Reranking

Can Jin, Hongwu Peng, Shiyu Zhao, Zhenting Wang, Wujiang Xu, Ligong Han, Jiahui Zhao, Kai Zhong, Sanguthevar Rajasekaran, Dimitris N. Metaxas

TL;DR

The paper tackles the bottleneck of manual prompt engineering in zero-shot LLM reranking for information retrieval. It introduces APEER, an automatic prompt engineering framework that iteratively improves prompts through Feedback Optimization and Preference Optimization, trained on MS MARCO and evaluated across GPT-4, GPT3.5, LLaMA3, and Qwen2 on BEIR and TREC benchmarks. Empirical results show consistent gains over state-of-the-art manual prompts and strong transferability across datasets and models, with ablations confirming the value of preference learning. The approach reduces human effort in prompt design while delivering robust, cross-domain reranking performance suitable for real-world IR systems.

Abstract

Large Language Models (LLMs) have significantly enhanced Information Retrieval (IR) across various modules, such as reranking. Despite impressive performance, current zero-shot relevance ranking with LLMs heavily relies on human prompt engineering. Existing automatic prompt engineering algorithms primarily focus on language modeling and classification tasks, leaving the domain of IR, particularly reranking, underexplored. Directly applying current prompt engineering algorithms to relevance ranking is challenging due to the integration of query and long passage pairs in the input, where the ranking complexity surpasses classification tasks. To reduce human effort and unlock the potential of prompt optimization in reranking, we introduce a novel automatic prompt engineering algorithm named APEER. APEER iteratively generates refined prompts through feedback and preference optimization. Extensive experiments with four LLMs and ten datasets demonstrate the substantial performance improvement of APEER over existing state-of-the-art (SoTA) manual prompts. Furthermore, we find that the prompts generated by APEER exhibit better transferability across diverse tasks and LLMs.

APEER: Automatic Prompt Engineering Enhances Large Language Model Reranking

TL;DR

Abstract

Paper Structure (29 sections, 6 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 29 sections, 6 equations, 4 figures, 5 tables, 1 algorithm.

Introduction
Related Works
Prompt Engineer
LLMs for Information Retrieval
Method
Problem Formulation
Build Training Dataset
Prompt Initialization
Feedback Optimization
Preference Optimization
Experiments
Implementation Details
Models.
Benchmarks.
Baselines.
...and 14 more sections

Figures (4)

Figure 1: Performance overview of four prompting methods on GPT4, LLaMA3 llama3modelcard and Qwen2 qwen2 models and BEIR datasets thakur2021beir. The manual prompt is RankGPT sun2023chatgpt. Modifying the manual prompt with CoT and paraphrasing yields marginal gains.
Figure 2: Overview of APEER . APEER iteratively refines prompts through two optimization steps. In Feedback Optimization, it refines the current prompt $p$ and creates a refined prompt $p'$ based on feedback. In Preference Optimization, it further optimizes $p'$ by learning preferences from a set of positive and negative prompt demonstrations.
Figure 3: Ablation results of training dataset size. We train LLaMA3 model on various training dataset sizes and evaluate on TREC-DL19 and TREC-DL20.
Figure 4: Illustration of APEER training responses. In Feedback Optimization, the LLM provides feedback on the original prompt and refine it based on the feedback. In Preference Optimization, the LLM mutate the refined prompt towards the positive prompt.

APEER: Automatic Prompt Engineering Enhances Large Language Model Reranking

TL;DR

Abstract

APEER: Automatic Prompt Engineering Enhances Large Language Model Reranking

Authors

TL;DR

Abstract

Table of Contents

Figures (4)