Table of Contents
Fetching ...

VSFormer: Mining Correlations in Flexible View Set for Multi-view 3D Shape Understanding

Hongyu Sun, Yongcai Wang, Peng Wang, Haoran Deng, Xudong Cai, Deying Li

TL;DR

A nimble Transformer model, named VSFormer, is devised to incorporate different views of a 3D shape into a permutation-invariant set, referred to as View Set, which removes rigid relation assumptions and facilitates adequate information exchange and fusion among views.

Abstract

View-based methods have demonstrated promising performance in 3D shape understanding. However, they tend to make strong assumptions about the relations between views or learn the multi-view correlations indirectly, which limits the flexibility of exploring inter-view correlations and the effectiveness of target tasks. To overcome the above problems, this paper investigates flexible organization and explicit correlation learning for multiple views. In particular, we propose to incorporate different views of a 3D shape into a permutation-invariant set, referred to as \emph{View Set}, which removes rigid relation assumptions and facilitates adequate information exchange and fusion among views. Based on that, we devise a nimble Transformer model, named \emph{VSFormer}, to explicitly capture pairwise and higher-order correlations of all elements in the set. Meanwhile, we theoretically reveal a natural correspondence between the Cartesian product of a view set and the correlation matrix in the attention mechanism, which supports our model design. Comprehensive experiments suggest that VSFormer has better flexibility, efficient inference efficiency and superior performance. Notably, VSFormer reaches state-of-the-art results on various 3d recognition datasets, including ModelNet40, ScanObjectNN and RGBD. It also establishes new records on the SHREC'17 retrieval benchmark. The code and datasets are available at \url{https://github.com/auniquesun/VSFormer}.

VSFormer: Mining Correlations in Flexible View Set for Multi-view 3D Shape Understanding

TL;DR

A nimble Transformer model, named VSFormer, is devised to incorporate different views of a 3D shape into a permutation-invariant set, referred to as View Set, which removes rigid relation assumptions and facilitates adequate information exchange and fusion among views.

Abstract

View-based methods have demonstrated promising performance in 3D shape understanding. However, they tend to make strong assumptions about the relations between views or learn the multi-view correlations indirectly, which limits the flexibility of exploring inter-view correlations and the effectiveness of target tasks. To overcome the above problems, this paper investigates flexible organization and explicit correlation learning for multiple views. In particular, we propose to incorporate different views of a 3D shape into a permutation-invariant set, referred to as \emph{View Set}, which removes rigid relation assumptions and facilitates adequate information exchange and fusion among views. Based on that, we devise a nimble Transformer model, named \emph{VSFormer}, to explicitly capture pairwise and higher-order correlations of all elements in the set. Meanwhile, we theoretically reveal a natural correspondence between the Cartesian product of a view set and the correlation matrix in the attention mechanism, which supports our model design. Comprehensive experiments suggest that VSFormer has better flexibility, efficient inference efficiency and superior performance. Notably, VSFormer reaches state-of-the-art results on various 3d recognition datasets, including ModelNet40, ScanObjectNN and RGBD. It also establishes new records on the SHREC'17 retrieval benchmark. The code and datasets are available at \url{https://github.com/auniquesun/VSFormer}.
Paper Structure (39 sections, 1 theorem, 8 equations, 9 figures, 18 tables)

This paper contains 39 sections, 1 theorem, 8 equations, 9 figures, 18 tables.

Key Result

Theorem 1

The Cartesian product $\mathcal{P}$ of a view set $\mathcal{V}$ can be formulated by a correlation matrix $\mathcal{A}$ and computed by the attention mechanism.

Figures (9)

  • Figure 1: A division for multi-view 3D shape analysis methods. The division is based on how they organize views and aggregate multi-view information. View Set is adopted by VSFormer that the views of a 3D shape are organized in a set.
  • Figure 2: The overall architecture of VSFormer. It consists of 4 modules: Initializer (Init), Encoder, Transition (Transit) and Decoder. Encoder is responsible for grasping pairwise and higher-order correlations of views in a set.
  • Figure 3: Visualization of multi-view attention of 8 views of a nightstand in colored lines.
  • Figure 4: Visualization of the attention scores for 8 views of a 3D airplane.
  • Figure 5: Visualization of 3D shape feature distribution on (a) ScanObjectNN (SONN) of 15 classes (b) ModelNet40 (MN40) of 40 classes (c) RGBD of 51 classes.
  • ...and 4 more figures

Theorems & Definitions (1)

  • Theorem 1: Correspondence between the Cartesian product of a view set and the correlation matrix in the attention mechanism