Generalized Attention Flow: Feature Attribution for Transformer Models via Maximum Flow
Behrooz Azarkhalili, Maxwell Libbrecht
TL;DR
This paper tackles the problem of extracting reliable feature attributions from Transformer models by moving beyond attention weights to a principled network-flow approach. It introduces Generalized Attention Flow (GAF), which builds an Information Tensor from attention and its gradients and solves a barrier-regularized maximum flow problem to obtain unique, Shapley-valued attributions. The authors prove that the barrier-regularized outputs satisfy the Shapley axioms and demonstrate through extensive benchmarks that a variant of GAF often outperforms existing attribution methods on sequence classification tasks. An open-source Python package is provided to compute these attributions for encoder-only Transformer models, enabling broader application and evaluation in NLP settings. Overall, the approach offers a theoretically grounded and practically effective framework for interpreting Transformer decisions.
Abstract
This paper introduces Generalized Attention Flow (GAF), a novel feature attribution method for Transformer-based models to address the limitations of current approaches. By extending Attention Flow and replacing attention weights with the generalized Information Tensor, GAF integrates attention weights, their gradients, the maximum flow problem, and the barrier method to enhance the performance of feature attributions. The proposed method exhibits key theoretical properties and mitigates the shortcomings of prior techniques that rely solely on simple aggregation of attention weights. Our comprehensive benchmarking on sequence classification tasks demonstrates that a specific variant of GAF consistently outperforms state-of-the-art feature attribution methods in most evaluation settings, providing a more reliable interpretation of Transformer model outputs.
