Keypoint Counting Classifiers: Turning Vision Transformers into Self-Explainable Models Without Training

Kristoffer Wickstrøm; Teresa Dorszewski; Siyan Chen; Michael Kampffmeyer; Elisabeth Wetzer; Robert Jenssen

Keypoint Counting Classifiers: Turning Vision Transformers into Self-Explainable Models Without Training

Kristoffer Wickstrøm, Teresa Dorszewski, Siyan Chen, Michael Kampffmeyer, Elisabeth Wetzer, Robert Jenssen

TL;DR

This work tackles the explainability gap in Vision Transformer (ViT) foundation models by introducing Keypoint Counting Classifiers (KCCs), a training-free self-explainable paradigm that turns pretrained ViTs into SEMs. KCCs identify image keypoints, match them to prototype keypoints via mutual nearest neighbors, and classify by counting matches, with explanations visualized as interpretable keypoints rather than bounding boxes or heatmaps. The authors validate KCCs through quantitative benchmarks and a comprehensive user study, showing improved explanation quality and user understanding, and demonstrate that vision-language labeling (VLParts) can automatically describe keypoints to reduce reader bias. The work highlights the flexibility and practicality of training-free SEMs for ViTs and points to future improvements in keypoint weighting and bias reduction. Overall, KCCs represent a meaningful step toward transparent, reliable ViT-based models suitable for safety-critical applications.

Abstract

Current approaches for designing self-explainable models (SEMs) require complicated training procedures and specific architectures which makes them impractical. With the advance of general purpose foundation models based on Vision Transformers (ViTs), this impracticability becomes even more problematic. Therefore, new methods are necessary to provide transparency and reliability to ViT-based foundation models. In this work, we present a new method for turning any well-trained ViT-based model into a SEM without retraining, which we call Keypoint Counting Classifiers (KCCs). Recent works have shown that ViTs can automatically identify matching keypoints between images with high precision, and we build on these results to create an easily interpretable decision process that is inherently visualizable in the input. We perform an extensive evaluation which show that KCCs improve the human-machine communication compared to recent baselines. We believe that KCCs constitute an important step towards making ViT-based foundation models more transparent and reliable.

Keypoint Counting Classifiers: Turning Vision Transformers into Self-Explainable Models Without Training

TL;DR

Abstract

Keypoint Counting Classifiers: Turning Vision Transformers into Self-Explainable Models Without Training

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)