Table of Contents
Fetching ...

ViTree: Single-path Neural Tree for Step-wise Interpretable Fine-grained Visual Categorization

Danning Lao, Qi Liu, Jiazi Bu, Junchi Yan, Wei Shen

TL;DR

Fine-grained visual categorization models often struggle with interpretability; this paper proposes ViTree, which fuses a vision transformer backbone with a hard-path neural tree to enable step-wise, patch-based reasoning. ViTree uses hard patches selected at each node and a single-leaf path to refine representations and produce predictions, achieving strong performance while offering intrinsic interpretability. The authors demonstrate state-of-the-art results on CUB-200-2011 and Stanford Cars and validate interpretability through algorithmic transparency analysis, case studies, and human-centered surveys. The work advances practical FGVC by delivering interpretable, accurate models and providing insight into model decisions via human-understandable patches and paths.

Abstract

As computer vision continues to advance and finds widespread applications across various domains, the need for interpretability in deep learning models becomes paramount. Existing methods often resort to post-hoc techniques or prototypes to explain the decision-making process, which can be indirect and lack intrinsic illustration. In this research, we introduce ViTree, a novel approach for fine-grained visual categorization that combines the popular vision transformer as a feature extraction backbone with neural decision trees. By traversing the tree paths, ViTree effectively selects patches from transformer-processed features to highlight informative local regions, thereby refining representations in a step-wise manner. Unlike previous tree-based models that rely on soft distributions or ensembles of paths, ViTree selects a single tree path, offering a clearer and simpler decision-making process. This patch and path selectivity enhances model interpretability of ViTree, enabling better insights into the model's inner workings. Remarkably, extensive experimentation validates that this streamlined approach surpasses various strong competitors and achieves state-of-the-art performance while maintaining exceptional interpretability which is proved by multi-perspective methods. Code can be found at https://github.com/SJTU-DeepVisionLab/ViTree.

ViTree: Single-path Neural Tree for Step-wise Interpretable Fine-grained Visual Categorization

TL;DR

Fine-grained visual categorization models often struggle with interpretability; this paper proposes ViTree, which fuses a vision transformer backbone with a hard-path neural tree to enable step-wise, patch-based reasoning. ViTree uses hard patches selected at each node and a single-leaf path to refine representations and produce predictions, achieving strong performance while offering intrinsic interpretability. The authors demonstrate state-of-the-art results on CUB-200-2011 and Stanford Cars and validate interpretability through algorithmic transparency analysis, case studies, and human-centered surveys. The work advances practical FGVC by delivering interpretable, accurate models and providing insight into model decisions via human-understandable patches and paths.

Abstract

As computer vision continues to advance and finds widespread applications across various domains, the need for interpretability in deep learning models becomes paramount. Existing methods often resort to post-hoc techniques or prototypes to explain the decision-making process, which can be indirect and lack intrinsic illustration. In this research, we introduce ViTree, a novel approach for fine-grained visual categorization that combines the popular vision transformer as a feature extraction backbone with neural decision trees. By traversing the tree paths, ViTree effectively selects patches from transformer-processed features to highlight informative local regions, thereby refining representations in a step-wise manner. Unlike previous tree-based models that rely on soft distributions or ensembles of paths, ViTree selects a single tree path, offering a clearer and simpler decision-making process. This patch and path selectivity enhances model interpretability of ViTree, enabling better insights into the model's inner workings. Remarkably, extensive experimentation validates that this streamlined approach surpasses various strong competitors and achieves state-of-the-art performance while maintaining exceptional interpretability which is proved by multi-perspective methods. Code can be found at https://github.com/SJTU-DeepVisionLab/ViTree.
Paper Structure (28 sections, 10 equations, 5 figures, 3 tables)

This paper contains 28 sections, 10 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: A comparative analysis of human and model focus in avian classification. Left: Visualization of human attention distribution during bird classification. Right: Decision path of ViTree with highlighted patches, reflecting localized regions of interest. Conjunction of human focus and model patches are marked with checkmark, demonstrating consistency of cognitive approach between human and model.
  • Figure 2: Illustration of the proposed ViTree pipeline. Purple: The vision transformer module. It takes the raw images as input and output primary extracted features. Yellow: The neural tree module. The left part is a sketch of a tree and the right part is an example of parent-to-child representation learning process.
  • Figure 3: Confusion matrix on CUB-200-2011.
  • Figure 4: Effect of tree depth on CUB-200-2011.
  • Figure 5: ViTree's proficiency in capturing key classifying attributes in accordance with human among bird species: The upper part illustrates the model's patch selections along the decision path. Highlighted in red, these selected patches are labeled with their order on the image, and listed to the right of the figure. The lower part provides a summary of ChatGPT's insights on distinctive human-observable traits for each species. Notably, these traits align closely with our model's focal points, underscoring a robust harmony between our model's internal logic and human perspectives.