Table of Contents
Fetching ...

Multi-view Deep Subspace Clustering Networks

Pengfei Zhu, Xinjie Yao, Yu Wang, Binyuan Hui, Dawei Du, Qinghua Hu

TL;DR

This work proposes a novel multi-view deep subspace clustering network (MvDSCN) by learning a multi-view self-representation matrix in an end-to-end manner and demonstrates the superiority of the proposed multi-view subspace clustering model on both multi-feature and multi-modality learning.

Abstract

Multi-view subspace clustering aims to discover the inherent structure of data by fusing multiple views of complementary information. Most existing methods first extract multiple types of handcrafted features and then learn a joint affinity matrix for clustering. The disadvantage of this approach lies in two aspects: 1) multi-view relations are not embedded into feature learning, and 2) the end-to-end learning manner of deep learning is not suitable for multi-view clustering. Even when deep features have been extracted, it is a nontrivial problem to choose a proper backbone for clustering on different datasets. To address these issues, we propose the Multi-view Deep Subspace Clustering Networks (MvDSCN), which learns a multi-view self-representation matrix in an end-to-end manner. The MvDSCN consists of two sub-networks, \ie, a diversity network (Dnet) and a universality network (Unet). A latent space is built using deep convolutional autoencoders, and a self-representation matrix is learned in the latent space using a fully connected layer. Dnet learns view-specific self-representation matrices, whereas Unet learns a common self-representation matrix for all views. To exploit the complementarity of multi-view representations, the Hilbert--Schmidt independence criterion (HSIC) is introduced as a diversity regularizer that captures the nonlinear, high-order inter-view relations. Because different views share the same label space, the self-representation matrices of each view are aligned to the common one by universality regularization. The MvDSCN also unifies multiple backbones to boost clustering performance and avoid the need for model selection. Experiments demonstrate the superiority of the MvDSCN.

Multi-view Deep Subspace Clustering Networks

TL;DR

This work proposes a novel multi-view deep subspace clustering network (MvDSCN) by learning a multi-view self-representation matrix in an end-to-end manner and demonstrates the superiority of the proposed multi-view subspace clustering model on both multi-feature and multi-modality learning.

Abstract

Multi-view subspace clustering aims to discover the inherent structure of data by fusing multiple views of complementary information. Most existing methods first extract multiple types of handcrafted features and then learn a joint affinity matrix for clustering. The disadvantage of this approach lies in two aspects: 1) multi-view relations are not embedded into feature learning, and 2) the end-to-end learning manner of deep learning is not suitable for multi-view clustering. Even when deep features have been extracted, it is a nontrivial problem to choose a proper backbone for clustering on different datasets. To address these issues, we propose the Multi-view Deep Subspace Clustering Networks (MvDSCN), which learns a multi-view self-representation matrix in an end-to-end manner. The MvDSCN consists of two sub-networks, \ie, a diversity network (Dnet) and a universality network (Unet). A latent space is built using deep convolutional autoencoders, and a self-representation matrix is learned in the latent space using a fully connected layer. Dnet learns view-specific self-representation matrices, whereas Unet learns a common self-representation matrix for all views. To exploit the complementarity of multi-view representations, the Hilbert--Schmidt independence criterion (HSIC) is introduced as a diversity regularizer that captures the nonlinear, high-order inter-view relations. Because different views share the same label space, the self-representation matrices of each view are aligned to the common one by universality regularization. The MvDSCN also unifies multiple backbones to boost clustering performance and avoid the need for model selection. Experiments demonstrate the superiority of the MvDSCN.

Paper Structure

This paper contains 20 sections, 17 equations, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: Examples of multi-view learning. A sample can be represented by different modalities, such as images, video, and text. Different kinds of backbones, e.g., VGG, ResNet, and DenseNet, can be used to obtain multi-view feature representation.
  • Figure 2: MvDSCN, which consists of two parts, i.e., the Dnet, which learns view-specific representation using different autoencoders and an independent self-representation layer, and the Unet, which learns view-consistent representation using different autoencoders and a common self-representation layer.
  • Figure 3: Visualization of the affinity matrices of different views. The first three columns are the affinity matrices of view 1, view 2, and view 3 learned by the DSCN ji2017deep. The last column is the affinity matrix obtained by the MvDSCN for all views. The top row is the result on the Yale dataset, and the bottom row is the result on the ORL dataset.
  • Figure 4: MvDSCN with different pre-trained backbone models.
  • Figure 5: Samples from RGB-D Object dataset. RGB image (left) and the corresponding depth image using a recursive median filter (right).
  • ...and 4 more figures

Theorems & Definitions (1)

  • Definition 1