Table of Contents
Fetching ...

Answer, Assemble, Ace: Understanding How LMs Answer Multiple Choice Questions

Sarah Wiegreffe, Oyvind Tafjord, Yonatan Belinkov, Hannaneh Hajishirzi, Ashish Sabharwal

TL;DR

The paper investigates how transformer LMs answer formatted MCQA by combining activation patching and vocabulary projection to trace where and how the correct answer symbol is produced. It reveals a sparse, attention-driven mechanism operating primarily in a small set of middle layers (notably around layer 24) that selects and boosts the target symbol, with later layers propagating this signal in vocabulary space. The authors show this behavior is largely consistent across model families (Olmo, Llama, Qwen) and datasets, though exact layer indices and the degree of difficulty vary; they also present a synthetic Colors task to disentangle dataset-specific knowledge from formatted-MCQA ability and demonstrate that poorly performing models struggle to separate answer symbols in vocabulary space. Overall, the work advances mechanistic understanding of symbol binding in MCQA, highlighting the role of sparse MHSA heads and cross-layer refinement, and offers practical implications for evaluating and improving model reliability under varied MCQA formats.

Abstract

Multiple-choice question answering (MCQA) is a key competence of performant transformer language models that is tested by mainstream benchmarks. However, recent evidence shows that models can have quite a range of performance, particularly when the task format is diversified slightly (such as by shuffling answer choice order). In this work we ask: how do successful models perform formatted MCQA? We employ vocabulary projection and activation patching methods to localize key hidden states that encode relevant information for predicting the correct answer. We find that the prediction of a specific answer symbol is causally attributed to a few middle layers, and specifically their multi-head self-attention mechanisms. We show that subsequent layers increase the probability of the predicted answer symbol in vocabulary space, and that this probability increase is associated with a sparse set of attention heads with unique roles. We additionally uncover differences in how different models adjust to alternative symbols. Finally, we demonstrate that a synthetic task can disentangle sources of model error to pinpoint when a model has learned formatted MCQA, and show that logit differences between answer choice tokens continue to grow over the course of training.

Answer, Assemble, Ace: Understanding How LMs Answer Multiple Choice Questions

TL;DR

The paper investigates how transformer LMs answer formatted MCQA by combining activation patching and vocabulary projection to trace where and how the correct answer symbol is produced. It reveals a sparse, attention-driven mechanism operating primarily in a small set of middle layers (notably around layer 24) that selects and boosts the target symbol, with later layers propagating this signal in vocabulary space. The authors show this behavior is largely consistent across model families (Olmo, Llama, Qwen) and datasets, though exact layer indices and the degree of difficulty vary; they also present a synthetic Colors task to disentangle dataset-specific knowledge from formatted-MCQA ability and demonstrate that poorly performing models struggle to separate answer symbols in vocabulary space. Overall, the work advances mechanistic understanding of symbol binding in MCQA, highlighting the role of sparse MHSA heads and cross-layer refinement, and offers practical implications for evaluating and improving model reliability under varied MCQA formats.

Abstract

Multiple-choice question answering (MCQA) is a key competence of performant transformer language models that is tested by mainstream benchmarks. However, recent evidence shows that models can have quite a range of performance, particularly when the task format is diversified slightly (such as by shuffling answer choice order). In this work we ask: how do successful models perform formatted MCQA? We employ vocabulary projection and activation patching methods to localize key hidden states that encode relevant information for predicting the correct answer. We find that the prediction of a specific answer symbol is causally attributed to a few middle layers, and specifically their multi-head self-attention mechanisms. We show that subsequent layers increase the probability of the predicted answer symbol in vocabulary space, and that this probability increase is associated with a sparse set of attention heads with unique roles. We additionally uncover differences in how different models adjust to alternative symbols. Finally, we demonstrate that a synthetic task can disentangle sources of model error to pinpoint when a model has learned formatted MCQA, and show that logit differences between answer choice tokens continue to grow over the course of training.
Paper Structure (31 sections, 3 equations, 28 figures)

This paper contains 31 sections, 3 equations, 28 figures.

Figures (28)

  • Figure 1: We investigate the ability of transformer LMs to answer formatted multiple-choice questions, which involves producing an answer choice symbol (here, A or B). We discover 1-3 middle layers at the last token position, and particularly their multi-head self-attention functions, responsible for answer selection. Later layers assign increasing probability to the symbol of interest in the model's vocabulary space, for which a sparse set of attention heads are responsible. Finally, when the prompt contains unusual answer choice symbols such as Q/Z/R/X, some models initially assign high values to common answer symbols like A/B/C/D before aligning to the symbols in the prompt at a late layer.
  • Figure 2: Results by model on Colors, Hellaswag and MMLU. Plotted is the minimum accuracy across A/B/C/D, Q/Z/R/X, and 1/2/3/4 prompts, where the accuracy for each prompt is taken as the average over all four correct answer positions. 0-shot results for select models in \ref{['fig:0-shot-perf']}.
  • Figure 3: Average effect (top: logits; bottom: probits) of patching individual output hidden states for Olmo 7B 0724 Instruct ($x_B \rightarrow x_A$) on predictions correct under both prompts. Patterns are largely similar regardless of which position is used for replacement and the direction of replacement. See \ref{['fig:ct_bacd_llama_qwen']} for additional results.
  • Figure 4: Average projected logits (top) and probits (bottom) of answer tokens at each layer for Olmo 0724 7B Instruct, for correct 3-shot predictions with the prompt A/B/C/D. See \ref{['fig:across_tasks_llama']} for Llama 3.1 8B Instruct and \ref{['fig:across_tasks_qwen']} for Qwen 2.5 1.5B Instruct. See \ref{['fig:across_tasks_0shot']} for 0-shot results.
  • Figure 5: Average projected logits (top) and probits (bottom) of answer tokens at each layer for correct 3-shot predictions by Olmo 0724 7B Instruct on HellaSwag, with the prompt A/B/C/D. See \ref{['fig:b_a_c_d_llama2']} for Llama 3.1 8B Instruct, \ref{['fig:b_a_c_d_qwen']} for Qwen 2.5 1.5B Instruct, and \ref{['fig:b_a_c_d_0shot']} for 0-shot results.
  • ...and 23 more figures