Table of Contents
Fetching ...

Quantifying Logical Consistency in Transformers via Query-Key Alignment

Eduard Tulchinskii, Anastasia Voznyuk, Laida Kushnareva, Andrei Andriiainen, Irina Piontkovskaya, Evgeny Burnaev, Serguei Barannikov

TL;DR

This work introduces a Query–Key (QK) score that leverages internal transformer head interactions to evaluate logical consistency in multi-step reasoning, addressing gaps in coherence assessment left by chain-of-thought prompts. By computing $S^{(l,h)}_{QK}=\mathbf{q}_{a_i}^{(l,h)\top}\mathbf{k}_{s}^{(l,h)}$ across all heads in a single forward pass, the method identifies heads that reliably distinguish valid from invalid inferences. Experiments on ProntoQA-OOD, PARARULE Plus, and Extended-Multi-LogiEval show that selected QK-score heads can outperform the model’s final probabilities, with robust handling of distractors and varying depth, and some cross-domain generalization. The approach offers a scalable, interpretable alternative to ablation studies, providing a window into how internal reasoning signals align with logical validity and suggesting avenues to augment chain-of-thought prompting in practice.

Abstract

Large language models (LLMs) have demonstrated impressive performance in various natural language processing tasks, yet their ability to perform multi-step logical reasoning remains an open challenge. Although Chain-of-Thought prompting has improved logical reasoning by enabling models to generate intermediate steps, it lacks mechanisms to assess the coherence of these logical transitions. In this paper, we propose a novel, lightweight evaluation strategy for logical reasoning that uses query-key alignments inside transformer attention heads. By computing a single forward pass and extracting a "QK-score" from carefully chosen heads, our method reveals latent representations that reliably separate valid from invalid inferences, offering a scalable alternative to traditional ablation-based techniques. We also provide an empirical validation on multiple logical reasoning benchmarks, demonstrating improved robustness of our evaluation method against distractors and increased reasoning depth. The experiments were conducted on a diverse set of models, ranging from 1.5B to 70B parameters.

Quantifying Logical Consistency in Transformers via Query-Key Alignment

TL;DR

This work introduces a Query–Key (QK) score that leverages internal transformer head interactions to evaluate logical consistency in multi-step reasoning, addressing gaps in coherence assessment left by chain-of-thought prompts. By computing across all heads in a single forward pass, the method identifies heads that reliably distinguish valid from invalid inferences. Experiments on ProntoQA-OOD, PARARULE Plus, and Extended-Multi-LogiEval show that selected QK-score heads can outperform the model’s final probabilities, with robust handling of distractors and varying depth, and some cross-domain generalization. The approach offers a scalable, interpretable alternative to ablation studies, providing a window into how internal reasoning signals align with logical validity and suggesting avenues to augment chain-of-thought prompting in practice.

Abstract

Large language models (LLMs) have demonstrated impressive performance in various natural language processing tasks, yet their ability to perform multi-step logical reasoning remains an open challenge. Although Chain-of-Thought prompting has improved logical reasoning by enabling models to generate intermediate steps, it lacks mechanisms to assess the coherence of these logical transitions. In this paper, we propose a novel, lightweight evaluation strategy for logical reasoning that uses query-key alignments inside transformer attention heads. By computing a single forward pass and extracting a "QK-score" from carefully chosen heads, our method reveals latent representations that reliably separate valid from invalid inferences, offering a scalable alternative to traditional ablation-based techniques. We also provide an empirical validation on multiple logical reasoning benchmarks, demonstrating improved robustness of our evaluation method against distractors and increased reasoning depth. The experiments were conducted on a diverse set of models, ranging from 1.5B to 70B parameters.

Paper Structure

This paper contains 17 sections, 1 equation, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Our method calculates the Query-Key score between the end-of-line token immediately after the statement and the "true"/"false" tokens, for the designated head, from which we derive the answer.
  • Figure 2: PrOntoQA-OOD Example
  • Figure 3: In-domain performamce on ProntoQA-OOD dataset. Best head was selected on calibration data for each case individually.
  • Figure 4: PARARULE-PLUS prompt example of reasoning, depth 2
  • Figure 5: Multi-LogiEval Modus Ponens Example of reasoning depth 1