Table of Contents
Fetching ...

This Looks Better than That: Better Interpretable Models with ProtoPNeXt

Frank Willard, Luke Moffett, Emmanuel Mokel, Jon Donnelly, Stark Guo, Julia Yang, Giyoung Kim, Alina Jade Barnett, Cynthia Rudin

TL;DR

The paper tackles the difficulty of deploying interpretable prototypical-part networks by introducing ProtoPNeXt, a unified, tunable framework. It demonstrates that adopting cosine similarity for prototype comparison and applying Bayesian hyperparameter optimization yields state-of-the-art accuracy on CUB-200 across multiple backbones, challenging claims that newer methods alone drive gains. The authors further show that joint optimization for accuracy and prototype interpretability improves semantic quality of prototypes without notable accuracy loss, offering a practical path toward more trustworthy prototypical models. These findings suggest that careful tuning and a focus on prototype quality are key to both performance and interpretability, with implications for broader adoption in real-world tasks. The work includes extensive analyses, practical guidelines, and a plan to release code and models to support future research and deployment.

Abstract

Prototypical-part models are a popular interpretable alternative to black-box deep learning models for computer vision. However, they are difficult to train, with high sensitivity to hyperparameter tuning, inhibiting their application to new datasets and our understanding of which methods truly improve their performance. To facilitate the careful study of prototypical-part networks (ProtoPNets), we create a new framework for integrating components of prototypical-part models -- ProtoPNeXt. Using ProtoPNeXt, we show that applying Bayesian hyperparameter tuning and an angular prototype similarity metric to the original ProtoPNet is sufficient to produce new state-of-the-art accuracy for prototypical-part models on CUB-200 across multiple backbones. We further deploy this framework to jointly optimize for accuracy and prototype interpretability as measured by metrics included in ProtoPNeXt. Using the same resources, this produces models with substantially superior semantics and changes in accuracy between +1.3% and -1.5%. The code and trained models will be made publicly available upon publication.

This Looks Better than That: Better Interpretable Models with ProtoPNeXt

TL;DR

The paper tackles the difficulty of deploying interpretable prototypical-part networks by introducing ProtoPNeXt, a unified, tunable framework. It demonstrates that adopting cosine similarity for prototype comparison and applying Bayesian hyperparameter optimization yields state-of-the-art accuracy on CUB-200 across multiple backbones, challenging claims that newer methods alone drive gains. The authors further show that joint optimization for accuracy and prototype interpretability improves semantic quality of prototypes without notable accuracy loss, offering a practical path toward more trustworthy prototypical models. These findings suggest that careful tuning and a focus on prototype quality are key to both performance and interpretability, with implications for broader adoption in real-world tasks. The work includes extensive analyses, practical guidelines, and a plan to release code and models to support future research and deployment.

Abstract

Prototypical-part models are a popular interpretable alternative to black-box deep learning models for computer vision. However, they are difficult to train, with high sensitivity to hyperparameter tuning, inhibiting their application to new datasets and our understanding of which methods truly improve their performance. To facilitate the careful study of prototypical-part networks (ProtoPNets), we create a new framework for integrating components of prototypical-part models -- ProtoPNeXt. Using ProtoPNeXt, we show that applying Bayesian hyperparameter tuning and an angular prototype similarity metric to the original ProtoPNet is sufficient to produce new state-of-the-art accuracy for prototypical-part models on CUB-200 across multiple backbones. We further deploy this framework to jointly optimize for accuracy and prototype interpretability as measured by metrics included in ProtoPNeXt. Using the same resources, this produces models with substantially superior semantics and changes in accuracy between +1.3% and -1.5%. The code and trained models will be made publicly available upon publication.
Paper Structure (19 sections, 1 equation, 13 figures, 6 tables)

This paper contains 19 sections, 1 equation, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Eight randomly selected prototypes from a model produced with ProtoPNeXt. All prototypes from the model demonstrate strong semantics. The model has 86.2% accuracy on uncropped CUB-200, and the full set of 253 prototypes from this model can be seen here: https://drive.google.com/drive/folders/174f2PPhLRarLevOhjm3Vc_YDt8x11mQ-.
  • Figure 2: Comparison of Trainability of Euclidean vs. Cosine Similarity Models. The distribution of observed validation accuracy across all runs in 12 computational days of hyperparameter tuning. Using cosine similarity, we found a substantially higher optimal accuracy, a larger number of trained models, and a larger proportion of models trained achieved high accuracy.
  • Figure 3: Accuracy Progression by GPU-Hours. GPU-hours are calculated as the product of the number of GPUs used (4) and the number of hours of training. The two cosine similarity models, 'ProtoPNet with cosine' and 'deformable', start with better performance and achieve saturation faster than 'ProtoPNet with Euclidean distance.' 'ProtoPNet with cosine distance' achieves saturation in under 50 GPU hours on Densenet-121 and VGG-16, and under 100 GPU hours on other backbones.
  • Figure 4: Comparing global analysis of best joint- and accuracy-only-optimized models. The leftmost image in each collection is a prototype, followed by the five images with the highest activations for that prototype. Models were selected for best validation accuracy across all configurations. Prototypes from the jointly optimized model are more precise and consistent. Joint model: 86.2% test accuracy, 81.4 test prototype score; Accuracy only: 86.4% test accuracy, 67.6 prototype score.
  • Figure 5: Comparing Generalization of Rigid Prototypes to Deformable Prototypes. Accuracy Difference is the difference between validation accuracy and test accuracy. Deformable prototypes have marginally better generalization than rigid using cosine similarity.
  • ...and 8 more figures