CriSPO: Multi-Aspect Critique-Suggestion-guided Automatic Prompt Optimization for Text Generation

Han He; Qianchu Liu; Lei Xu; Chaitanya Shivade; Yi Zhang; Sundararajan Srinivasan; Katrin Kirchhoff

CriSPO: Multi-Aspect Critique-Suggestion-guided Automatic Prompt Optimization for Text Generation

Han He, Qianchu Liu, Lei Xu, Chaitanya Shivade, Yi Zhang, Sundararajan Srinivasan, Katrin Kirchhoff

TL;DR

CriSPO introduces a multi-aspect Critique-Suggestion-guided Automatic Prompt Optimization framework for text generation. It leverages a critique-suggestion module to autonomously identify evaluation aspects and provide actionable feedback, complemented by a receptive optimizer that uses past critiques with chain-of-thought reasoning to generate better prompts. An Automatic Suffix Tuning extension enables multi-metric optimization by attaching a tunable suffix to the prompt, improving metrics such as AlignScore while preserving ROUGE performance. Empirically, CriSPO improves ROUGE by about 3-4 points on summarization across four LLMs and nine datasets, with substantial gains on QA tasks and supportive human evaluation; it also demonstrates generalization to additional tasks and fosters diverse prompt exploration compared to prior methods.

Abstract

Existing automatic prompt engineering methods are typically designed for discriminative tasks, where new task prompts are iteratively refined with limited feedback from a single metric reflecting a single aspect. However, these approaches are suboptimal for generative tasks, which require more nuanced guidance beyond a single numeric metric to improve the prompt and optimize multiple aspects of the generated text. To address these challenges, we propose a novel multi-aspect Critique-Suggestion-guided automatic Prompt Optimization (CriSPO) approach. CriSPO introduces a critique-suggestion module as its core component. This module spontaneously discovers aspects, and compares generated and reference texts across these aspects, providing specific suggestions for prompt modification. These clear critiques and actionable suggestions guide a receptive optimizer module to make more substantial changes, exploring a broader and more effective search space. To further improve CriSPO with multi-metric optimization, we introduce an Automatic Suffix Tuning (AST) extension to enhance the performance of task prompts across multiple metrics. We evaluate CriSPO on 4 state-of-the-art LLMs across 4 summarization and 5 QA datasets. Extensive experiments show 3-4% ROUGE score improvement on summarization and substantial improvement of various metrics on QA. Code available at https://github.com/amazon-science/crispo

CriSPO: Multi-Aspect Critique-Suggestion-guided Automatic Prompt Optimization for Text Generation

TL;DR

Abstract

Paper Structure (66 sections, 2 equations, 4 figures, 17 tables)

This paper contains 66 sections, 2 equations, 4 figures, 17 tables.

Introduction
Related Work
Method
Multi-Aspect Critiques and Suggestions
Constructive critiques with spontaneous dimension discovery:
Multi-aspect suggestions:
Receptive Prompt Optimizer
Enriched optimization trajectory:
Chain-of-thought:
Flexible task prompt template:
Multi-Metric Automatic Suffix Tuning
Main Experiments
Experiment Setup
Datasets
LLM and Baselines
...and 51 more sections

Figures (4)

Figure 1: The CriSPO workflow for text generation tasks. In each iteration, a candidate task prompt $p_t$ is applied to $\mathcal{D}_{\texttt{trn}}$ (step 1) and evaluated using a multi-aspect critique-suggestion meta-prompt $M_c$ (step 2). We select top-$K$ previously sampled task prompts (step 3) and use a receptive optimizer $M_o$ to generate the next candidate $p_{t+1}$ (step 4). The automatic optimization loop runs multiple iterations, while the best task prompt is selected based on performance on $\mathcal{D}_{\texttt{dev}}$.
Figure 2: A word cloud showing the different aspects identified by CriSPO when comparing generations and references.
Figure 3: Ablation studies with Claude Instant on SAMSum. For each setup, we report the mean ROUGE1 F1 and standard deviation across three runs. For (c), we change the random seed to select different samples across the three runs.
Figure 4: Visualization of prompt diversity on 4 summarization datasets for OPRO in red $\bullet$ and CriSPO in blue $\times$.

CriSPO: Multi-Aspect Critique-Suggestion-guided Automatic Prompt Optimization for Text Generation

TL;DR

Abstract

CriSPO: Multi-Aspect Critique-Suggestion-guided Automatic Prompt Optimization for Text Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)