Table of Contents
Fetching ...

Visual Recognition with Deep Nearest Centroids

Wenguan Wang, Cheng Han, Tianfei Zhou, Dongfang Liu

TL;DR

The paper proposes Deep Nearest Centroids (DNC), a nonparametric, centroid-based classifier for visual recognition that replaces the parametric softmax with class sub-centroids discovered via Sinkhorn clustering and updated through momentum on external memory. DNC jointly learns embeddings and meaningful within-class structure, enabling transfer from ImageNet representations to tasks like segmentation, while providing ad-hoc, exemplar-based explanations. Across image classification and semantic segmentation, DNC demonstrates consistent performance gains with far fewer learnable parameters and offers interpretable decision evidence. The approach opens avenues for stronger transferability, explainability, and integration with metric-learning frameworks in large-scale vision systems.

Abstract

We devise deep nearest centroids (DNC), a conceptually elegant yet surprisingly effective network for large-scale visual recognition, by revisiting Nearest Centroids, one of the most classic and simple classifiers. Current deep models learn the classifier in a fully parametric manner, ignoring the latent data structure and lacking simplicity and explainability. DNC instead conducts nonparametric, case-based reasoning; it utilizes sub-centroids of training samples to describe class distributions and clearly explains the classification as the proximity of test data and the class sub-centroids in the feature space. Due to the distance-based nature, the network output dimensionality is flexible, and all the learnable parameters are only for data embedding. That means all the knowledge learnt for ImageNet classification can be completely transferred for pixel recognition learning, under the "pre-training and fine-tuning" paradigm. Apart from its nested simplicity and intuitive decision-making mechanism, DNC can even possess ad-hoc explainability when the sub-centroids are selected as actual training images that humans can view and inspect. Compared with parametric counterparts, DNC performs better on image classification (CIFAR-10, ImageNet) and greatly boots pixel recognition (ADE20K, Cityscapes), with improved transparency and fewer learnable parameters, using various network architectures (ResNet, Swin) and segmentation models (FCN, DeepLabV3, Swin). We feel this work brings fundamental insights into related fields.

Visual Recognition with Deep Nearest Centroids

TL;DR

The paper proposes Deep Nearest Centroids (DNC), a nonparametric, centroid-based classifier for visual recognition that replaces the parametric softmax with class sub-centroids discovered via Sinkhorn clustering and updated through momentum on external memory. DNC jointly learns embeddings and meaningful within-class structure, enabling transfer from ImageNet representations to tasks like segmentation, while providing ad-hoc, exemplar-based explanations. Across image classification and semantic segmentation, DNC demonstrates consistent performance gains with far fewer learnable parameters and offers interpretable decision evidence. The approach opens avenues for stronger transferability, explainability, and integration with metric-learning frameworks in large-scale vision systems.

Abstract

We devise deep nearest centroids (DNC), a conceptually elegant yet surprisingly effective network for large-scale visual recognition, by revisiting Nearest Centroids, one of the most classic and simple classifiers. Current deep models learn the classifier in a fully parametric manner, ignoring the latent data structure and lacking simplicity and explainability. DNC instead conducts nonparametric, case-based reasoning; it utilizes sub-centroids of training samples to describe class distributions and clearly explains the classification as the proximity of test data and the class sub-centroids in the feature space. Due to the distance-based nature, the network output dimensionality is flexible, and all the learnable parameters are only for data embedding. That means all the knowledge learnt for ImageNet classification can be completely transferred for pixel recognition learning, under the "pre-training and fine-tuning" paradigm. Apart from its nested simplicity and intuitive decision-making mechanism, DNC can even possess ad-hoc explainability when the sub-centroids are selected as actual training images that humans can view and inspect. Compared with parametric counterparts, DNC performs better on image classification (CIFAR-10, ImageNet) and greatly boots pixel recognition (ADE20K, Cityscapes), with improved transparency and fewer learnable parameters, using various network architectures (ResNet, Swin) and segmentation models (FCN, DeepLabV3, Swin). We feel this work brings fundamental insights into related fields.
Paper Structure (24 sections, 10 equations, 7 figures, 25 tables, 1 algorithm)

This paper contains 24 sections, 10 equations, 7 figures, 25 tables, 1 algorithm.

Figures (7)

  • Figure 1: (b) Prevalent visual recognition models , built upon parametric softmax classifiers, have$_{\!}$ a$_{\!}$ few$_{\!}$ limitations,$_{\!}$ such$_{\!}$ as$_{\!}$ their$_{\!}$ non-transparent$_{\!}$ decision-making$_{\!}$ process.$_{\!}$ (c) Humans can use past cases as models when solving new problems$_{\!}$newell1972humanaamodt1994case (e.g., comparing$_{\!}$$_{\!}$ with$_{\!}$ a$_{\!}$ few$_{\!}$ familiar/exemplar$_{\!}$ animals$_{\!}$ for$_{\!}$ categorization).$_{\!}$ (d) DNC makes$_{\!}$ classification$_{\!}$ based$_{\!}$ on$_{\!}$ the$_{\!}$ similarity$_{\!}$ of$_{\!}$$_{\!}$ to$_{\!}$ class$_{\!}$ sub-centroids$_{\!}$ (representative$_{\!}$ training$_{\!}$ examples)$_{\!}$ in$_{\!}$ the$_{\!}$ feature$_{\!}$ space.$_{\!}$ The$_{\!}$ class sub-centroids$_{\!}$ are$_{\!}$ vital$_{\!}$ for$_{\!}$ capturing underlying data structure, enhancing interpretability, and boosting recognition.
  • Figure 2: $_{\!}$With$_{\!}$ a$_{\!}$ distance-/case-based$_{\!}$ classification$_{\!}$ scheme,$_{\!}$ DNC$_{\!}$ combines$_{\!}$ unsupervised sub-pattern discovery and supervised representation learning in a synergy.
  • Figure 3: $_{\!}$DNC$_{\!}$ can$_{\!}$ provide$_{\!}$ (dis)similarity-based$_{\!}$ interpretation.$_{\!}$ For$_{\!}$ the$_{\!}$ two$_{\!}$ test$_{\!}$ samples,$_{\!}$ we$_{\!}$ only$_{\!}$ plot the normalized similarities for their corresponding closest sub-centroids from top-4 scoring classes.
  • Figure 4: Sub-centroid images for eight randomly chosen classes from ImageNet ImageNet. See §\ref{['sec_app:A8']} for more details.
  • Figure 5: More examples on DNC interpreting its predictions based on its computed similarity to class sub-centroid images. For each test image, we plot the normalized similarities for the corresponding closest sub-centroids from the top-4 scoring classes. See §\ref{['sec_app:A8']} for more details.
  • ...and 2 more figures