Attention Consistency for LLMs Explanation
Tian Lan, Jinyuan Xu, Xue He, Jenq-Neng Hwang, Lei Li
TL;DR
MACS introduces a lightweight, inference-time heuristic for token attribution in decoder-only Transformer models by measuring the cross-layer consistency of the strongest input-attention links. Unlike full-aggregation methods, MACS uses layer-wise max-pooling, a floor bias, and a multiplicative accumulation across layers, followed by z-score normalization to yield clear, sparse attributions with real-time efficiency. Empirical results on QA (SQuAD 2.0 subset) show MACS delivering higher ranking of ground-truth answer tokens (AUC-PR) and competitive faithfulness (SRG) compared to stronger baselines, while requiring far less VRAM and preserving throughput. A preliminary VQA study suggests MACS can extend to multimodal Transformers by analyzing attention in cross-modal layers, highlighting its potential as a general, efficient diagnostic tool for interpretability in diverse transformer architectures.
Abstract
Understanding the decision-making processes of large language models (LLMs) is essential for their trustworthy development and deployment. However, current interpretability methods often face challenges such as low resolution and high computational cost. To address these limitations, we propose the \textbf{Multi-Layer Attention Consistency Score (MACS)}, a novel, lightweight, and easily deployable heuristic for estimating the importance of input tokens in decoder-based models. MACS measures contributions of input tokens based on the consistency of maximal attention. Empirical evaluations demonstrate that MACS achieves a favorable trade-off between interpretability quality and computational efficiency, showing faithfulness comparable to complex techniques with a 22\% decrease in VRAM usage and 30\% reduction in latency.
