Table of Contents
Fetching ...

Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Prefilling Attack

Silvia Cappelletti, Tobia Poppi, Samuele Poppi, Zheng-Xin Yong, Diego Garcia-Olano, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

TL;DR

This paper addresses the fragility of first-token probability (FTP) in MCQA by introducing a prefilling attack: prepending a structured prefix (e.g., 'The correct option is:') to steer LLM responses so the first token is a valid option. The method preserves FTP efficiency while dramatically reducing first-token misalignment and misinterpretation, improving accuracy, calibration, and output consistency across diverse models and benchmarks. It achieves performance close to open-ended generation with external classifiers, but at a fraction of the cost, making FTP-based evaluation more reliable for MCQA. The findings suggest prefilling as a robust, low-cost technique to enhance symbolic decoding in LLM-driven QA tasks with broad practical impact for research and deployment.

Abstract

Large Language Models (LLMs) are increasingly evaluated on multiple-choice question answering (MCQA) tasks using *first-token probability* (FTP), which selects the answer option whose initial token has the highest likelihood. While efficient, FTP can be fragile: models may assign high probability to unrelated tokens (*misalignment*) or use a valid token merely as part of a generic preamble rather than as a clear answer choice (*misinterpretation*), undermining the reliability of symbolic evaluation. We propose a simple solution: the *prefilling attack*, a structured natural-language prefix (e.g., "*The correct option is:*") prepended to the model output. Originally explored in AI safety, we repurpose prefilling to steer the model to respond with a clean, valid option, without modifying its parameters. Empirically, the FTP with prefilling strategy substantially improves accuracy, calibration, and output consistency across a broad set of LLMs and MCQA benchmarks. It outperforms standard FTP and often matches the performance of open-ended generation approaches that require full decoding and external classifiers, while being significantly more efficient. Our findings suggest that prefilling is a simple, robust, and low-cost method to enhance the reliability of FTP-based evaluation in multiple-choice settings.

Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Prefilling Attack

TL;DR

This paper addresses the fragility of first-token probability (FTP) in MCQA by introducing a prefilling attack: prepending a structured prefix (e.g., 'The correct option is:') to steer LLM responses so the first token is a valid option. The method preserves FTP efficiency while dramatically reducing first-token misalignment and misinterpretation, improving accuracy, calibration, and output consistency across diverse models and benchmarks. It achieves performance close to open-ended generation with external classifiers, but at a fraction of the cost, making FTP-based evaluation more reliable for MCQA. The findings suggest prefilling as a robust, low-cost technique to enhance symbolic decoding in LLM-driven QA tasks with broad practical impact for research and deployment.

Abstract

Large Language Models (LLMs) are increasingly evaluated on multiple-choice question answering (MCQA) tasks using *first-token probability* (FTP), which selects the answer option whose initial token has the highest likelihood. While efficient, FTP can be fragile: models may assign high probability to unrelated tokens (*misalignment*) or use a valid token merely as part of a generic preamble rather than as a clear answer choice (*misinterpretation*), undermining the reliability of symbolic evaluation. We propose a simple solution: the *prefilling attack*, a structured natural-language prefix (e.g., "*The correct option is:*") prepended to the model output. Originally explored in AI safety, we repurpose prefilling to steer the model to respond with a clean, valid option, without modifying its parameters. Empirically, the FTP with prefilling strategy substantially improves accuracy, calibration, and output consistency across a broad set of LLMs and MCQA benchmarks. It outperforms standard FTP and often matches the performance of open-ended generation approaches that require full decoding and external classifiers, while being significantly more efficient. Our findings suggest that prefilling is a simple, robust, and low-cost method to enhance the reliability of FTP-based evaluation in multiple-choice settings.

Paper Structure

This paper contains 17 sections, 4 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: We show that a simple prefilling attack template, which directs an LLM's first generated token to a valid option for MCQA, substantially improves the standard first-token probability approach.
  • Figure 2: Visual examples of our prefilling strategy.
  • Figure 3: Comparison of model accuracy on OpenBookQA, Social IQa, and SciQ using standard FTP, FTP with prefilling, and open-ended generation with GPT-3.5-Turbo or Llama-3.1-70B as classifiers. FTP with prefilling consistently outperforms baseline FTP and often surpasses the more expensive open-ended generation approaches.
  • Figure 4: Calibration curves comparing standard FTP and FTP with prefilling. Prefilling improves calibration, bringing predictions closer to the ideal confidence -- accuracy alignment.
  • Figure 5: Accuracy comparison on all the benchmarks using standard FTP, FTP with prefilling, and open-ended generation with Llama-3.1-70B as classifier. Again, FTP with prefilling consistently outperforms standard FTP and often outperforms or is on par with the more computationally expensive open-ended generation approach.