ASCENT-ViT: Attention-based Scale-aware Concept Learning Framework for Enhanced Alignment in Vision Transformers
Sanchit Sinha, Guangzhi Xiong, Aidong Zhang
TL;DR
ASCENT-ViT addresses the challenge of interpretable alignment between human concepts and Vision Transformer representations by learning scale- and patch-aware features and aligning them through attention. It integrates three modules—Multi-scale Encoding (MSE), Deformable Multi-Scale Fusion (DMSF), and Concept-Representation Alignment Module (CRAM)—to produce explanations that reflect both spatial and global concepts while improving classification. Across five datasets, including CUB, AWA2, KITS, Pascal aPY, and Concept-MNIST, ASCENT-ViT achieves higher task accuracy and more accurate concept localization than model-agnostic baselines. The approach demonstrates robustness to transformations and provides a scalable, efficient explainability augmentation with only a small parameter overhead, making it practical for real-world deployment.
Abstract
As Vision Transformers (ViTs) are increasingly adopted in sensitive vision applications, there is a growing demand for improved interpretability. This has led to efforts to forward-align these models with carefully annotated abstract, human-understandable semantic entities - concepts. Concepts provide global rationales to the model predictions and can be quickly understood/intervened on by domain experts. Most current research focuses on designing model-agnostic, plug-and-play generic concept-based explainability modules that do not incorporate the inner workings of foundation models (e.g., inductive biases, scale invariance, etc.) during training. To alleviate this issue for ViTs, in this paper, we propose ASCENT-ViT, an attention-based, concept learning framework that effectively composes scale and position-aware representations from multiscale feature pyramids and ViT patch representations, respectively. Further, these representations are aligned with concept annotations through attention matrices - which incorporate spatial and global (semantic) concepts. ASCENT-ViT can be utilized as a classification head on top of standard ViT backbones for improved predictive performance and accurate and robust concept explanations as demonstrated on five datasets, including three widely used benchmarks (CUB, Pascal APY, Concept-MNIST) and 2 real-world datasets (AWA2, KITS).
