Table of Contents
Fetching ...

Neural Clustering based Visual Representation Learning

Guikun Chen, Xia Li, Yi Yang, Wenguan Wang

TL;DR

This work rethinks visual feature extraction by reframing it as neural clustering-based representation learning (FEC). By iteratively pooling and encoding via adaptively initialized cluster centers, FEC builds a hierarchy of cluster representatives that directly communicate with pixel features, yielding a transparent forward process and ad-hoc interpretability. The approach delivers competitive performance on ImageNet classification and extends to semantic segmentation and object detection with strong transferability, while enabling emergent segmentation without supervision. Limitations include reliance on a fixed number of clusters; future directions point to nonparametric clustering and tighter integration with set-prediction paradigms to enhance flexibility and scalability.

Abstract

We investigate a fundamental aspect of machine vision: the measurement of features, by revisiting clustering, one of the most classic approaches in machine learning and data analysis. Existing visual feature extractors, including ConvNets, ViTs, and MLPs, represent an image as rectangular regions. Though prevalent, such a grid-style paradigm is built upon engineering practice and lacks explicit modeling of data distribution. In this work, we propose feature extraction with clustering (FEC), a conceptually elegant yet surprisingly ad-hoc interpretable neural clustering framework, which views feature extraction as a process of selecting representatives from data and thus automatically captures the underlying data distribution. Given an image, FEC alternates between grouping pixels into individual clusters to abstract representatives and updating the deep features of pixels with current representatives. Such an iterative working mechanism is implemented in the form of several neural layers and the final representatives can be used for downstream tasks. The cluster assignments across layers, which can be viewed and inspected by humans, make the forward process of FEC fully transparent and empower it with promising ad-hoc interpretability. Extensive experiments on various visual recognition models and tasks verify the effectiveness, generality, and interpretability of FEC. We expect this work will provoke a rethink of the current de facto grid-style paradigm.

Neural Clustering based Visual Representation Learning

TL;DR

This work rethinks visual feature extraction by reframing it as neural clustering-based representation learning (FEC). By iteratively pooling and encoding via adaptively initialized cluster centers, FEC builds a hierarchy of cluster representatives that directly communicate with pixel features, yielding a transparent forward process and ad-hoc interpretability. The approach delivers competitive performance on ImageNet classification and extends to semantic segmentation and object detection with strong transferability, while enabling emergent segmentation without supervision. Limitations include reliance on a fixed number of clusters; future directions point to nonparametric clustering and tighter integration with set-prediction paradigms to enhance flexibility and scalability.

Abstract

We investigate a fundamental aspect of machine vision: the measurement of features, by revisiting clustering, one of the most classic approaches in machine learning and data analysis. Existing visual feature extractors, including ConvNets, ViTs, and MLPs, represent an image as rectangular regions. Though prevalent, such a grid-style paradigm is built upon engineering practice and lacks explicit modeling of data distribution. In this work, we propose feature extraction with clustering (FEC), a conceptually elegant yet surprisingly ad-hoc interpretable neural clustering framework, which views feature extraction as a process of selecting representatives from data and thus automatically captures the underlying data distribution. Given an image, FEC alternates between grouping pixels into individual clusters to abstract representatives and updating the deep features of pixels with current representatives. Such an iterative working mechanism is implemented in the form of several neural layers and the final representatives can be used for downstream tasks. The cluster assignments across layers, which can be viewed and inspected by humans, make the forward process of FEC fully transparent and empower it with promising ad-hoc interpretability. Extensive experiments on various visual recognition models and tasks verify the effectiveness, generality, and interpretability of FEC. We expect this work will provoke a rethink of the current de facto grid-style paradigm.
Paper Structure (15 sections, 9 equations, 6 figures, 2 tables)

This paper contains 15 sections, 9 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: How to represent an image in a low-dimensional space and what could explain it? (abc) Existing visual backbones rely on the computational modeling of rigid grids. (d) Derived from a neural clustering view, FEC reformulates the procedures of feature extraction as clustering, thereby representing the image with its representatives. Our approach possesses promising ad-hoc interpretability and demonstrates the emergence of segmentation despite being trained only on the classification task.
  • Figure 2: (a) Overall framework of FEC (§\ref{['sec:method_fec']}). Each stage $i$ contains $L^{i}$ clustering-based encode layers. (b) Illustration of our clustering-based feature pooling and encoding. (c) The basic elements during FEC's forward process are growing clusters instead of image patches.
  • Figure 3: Inspection of the modeled representatives (§\ref{['sec:exp_ins_rep']}) on ImageNet-1K ImageNetval. Different colored masks indicate different clusters. As the number of clusters decreases, each cluster tends to represent an entire object or a portion of an object, suggesting that FEC effectively captures the underlying data distribution of visual scenes.
  • Figure 4: $_{\!}$Quantitative$_{\!}$ results$_{\!}$ on$_{\!}$ ADE20K zhou2017scene$_{\!}$val$_{\!}$ for$_{\!}$semantic segmentation (§\ref{['sec:exp_sem_seg']}). Semantic$_{\!}$ FPN kirillov2019panoptic is adopted.
  • Figure 5: Quantitative results on COCO lin2014microsoftval2017 for object detection and semantic segmentation (§\ref{['sec:exp_det']}). We use Mask RCNN he2017mask to evaluate the performance of the proposed backbone on two tasks.
  • ...and 1 more figures