Table of Contents
Fetching ...

Improving Minimum Bayes Risk Decoding with Multi-Prompt

David Heineman, Yao Dou, Wei Xu

TL;DR

This work shows multi-prompt improves MBR across a comprehensive set of conditional generation tasks, and shows this is a result of estimating a more diverse and higher quality candidate space than that of a single prompt.

Abstract

While instruction fine-tuned LLMs are effective text generators, sensitivity to prompt construction makes performance unstable and sub-optimal in practice. Relying on a single "best" prompt cannot capture all differing approaches to a generation problem. Using this observation, we propose multi-prompt decoding, where many candidate generations are decoded from a prompt bank at inference-time. To ensemble candidates, we use Minimum Bayes Risk (MBR) decoding, which selects a final output using a trained value metric. We show multi-prompt improves MBR across a comprehensive set of conditional generation tasks, and show this is a result of estimating a more diverse and higher quality candidate space than that of a single prompt. Further experiments confirm multi-prompt improves generation across tasks, models and metrics.

Improving Minimum Bayes Risk Decoding with Multi-Prompt

TL;DR

This work shows multi-prompt improves MBR across a comprehensive set of conditional generation tasks, and shows this is a result of estimating a more diverse and higher quality candidate space than that of a single prompt.

Abstract

While instruction fine-tuned LLMs are effective text generators, sensitivity to prompt construction makes performance unstable and sub-optimal in practice. Relying on a single "best" prompt cannot capture all differing approaches to a generation problem. Using this observation, we propose multi-prompt decoding, where many candidate generations are decoded from a prompt bank at inference-time. To ensemble candidates, we use Minimum Bayes Risk (MBR) decoding, which selects a final output using a trained value metric. We show multi-prompt improves MBR across a comprehensive set of conditional generation tasks, and show this is a result of estimating a more diverse and higher quality candidate space than that of a single prompt. Further experiments confirm multi-prompt improves generation across tasks, models and metrics.
Paper Structure (35 sections, 7 equations, 10 figures, 8 tables)

This paper contains 35 sections, 7 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Multi-prompt and single prompt MBR results for code generation on HumanEval, text simplification on SimpEval, and translation on WMT '22 En-Cs generated with open-source 7B LLMs (details in §\ref{['sec:experiments']}).
  • Figure 2: Multi-prompt MBR generates candidates using a human- or model-written prompt bank and selects the highest pairwise score with a trained value metric.
  • Figure 3: (a) Lens score and sequence probability for 1000 generations on a single text simplification example decoded from Llama 2 7B Chat with temperatures $\tau=\left[0, 0.1, 0.5\right]$ using a single prompt (top) and multiple prompts (bottom). As the temperature increases, we find each prompt estimates candidate sequences centered at different modes. (b) Lens scores of the best generation per-prompt for the first 20 sentences in SimpEval, showing no single prompt produces the best overall output. (c) Dataset-level LENS performance of each prompt when performing single prompt MBR vs. multi-prompt MBR.
  • Figure 4: Candidate set diversity and Lens scores on SimpEval for 200 repetitions of single-prompt and multi-prompt at various temperatures. At low temperatures, the increased candidate diversity from multi-prompt directly translates to improved performance.
  • Figure 5: $\Delta$ metric improvement from single prompt to multi-prompt across model sizes and architectures, reported with a 95% CI bootstrapped over 20 iterations. For absolute performance, see Figure \ref{['fig:multi_prompt_detailed']}.
  • ...and 5 more figures