VSFormer: Mining Correlations in Flexible View Set for Multi-view 3D Shape Understanding

Hongyu Sun; Yongcai Wang; Peng Wang; Haoran Deng; Xudong Cai; Deying Li

VSFormer: Mining Correlations in Flexible View Set for Multi-view 3D Shape Understanding

Hongyu Sun, Yongcai Wang, Peng Wang, Haoran Deng, Xudong Cai, Deying Li

TL;DR

A nimble Transformer model, named VSFormer, is devised to incorporate different views of a 3D shape into a permutation-invariant set, referred to as View Set, which removes rigid relation assumptions and facilitates adequate information exchange and fusion among views.

Abstract

View-based methods have demonstrated promising performance in 3D shape understanding. However, they tend to make strong assumptions about the relations between views or learn the multi-view correlations indirectly, which limits the flexibility of exploring inter-view correlations and the effectiveness of target tasks. To overcome the above problems, this paper investigates flexible organization and explicit correlation learning for multiple views. In particular, we propose to incorporate different views of a 3D shape into a permutation-invariant set, referred to as \emph{View Set}, which removes rigid relation assumptions and facilitates adequate information exchange and fusion among views. Based on that, we devise a nimble Transformer model, named \emph{VSFormer}, to explicitly capture pairwise and higher-order correlations of all elements in the set. Meanwhile, we theoretically reveal a natural correspondence between the Cartesian product of a view set and the correlation matrix in the attention mechanism, which supports our model design. Comprehensive experiments suggest that VSFormer has better flexibility, efficient inference efficiency and superior performance. Notably, VSFormer reaches state-of-the-art results on various 3d recognition datasets, including ModelNet40, ScanObjectNN and RGBD. It also establishes new records on the SHREC'17 retrieval benchmark. The code and datasets are available at \url{https://github.com/auniquesun/VSFormer}.

VSFormer: Mining Correlations in Flexible View Set for Multi-view 3D Shape Understanding

TL;DR

Abstract

Paper Structure (39 sections, 1 theorem, 8 equations, 9 figures, 18 tables)

This paper contains 39 sections, 1 theorem, 8 equations, 9 figures, 18 tables.

Introduction
Related Work
Multi-view 3D Shape Analysis
Independent Views
View Sequence
View Graph
View Set
Set in Multi-view 3D Shape Analysis
Attention in Multi-view 3D Shape Analysis
Latest Progress in Multi-view 3D Shape Analysis
Methodology
Problem Formulation
View Set
3D Shape Recognition & Retrieval
View Set Attention Model
...and 24 more sections

Key Result

Theorem 1

The Cartesian product $\mathcal{P}$ of a view set $\mathcal{V}$ can be formulated by a correlation matrix $\mathcal{A}$ and computed by the attention mechanism.

Figures (9)

Figure 1: A division for multi-view 3D shape analysis methods. The division is based on how they organize views and aggregate multi-view information. View Set is adopted by VSFormer that the views of a 3D shape are organized in a set.
Figure 2: The overall architecture of VSFormer. It consists of 4 modules: Initializer (Init), Encoder, Transition (Transit) and Decoder. Encoder is responsible for grasping pairwise and higher-order correlations of views in a set.
Figure 3: Visualization of multi-view attention of 8 views of a nightstand in colored lines.
Figure 4: Visualization of the attention scores for 8 views of a 3D airplane.
Figure 5: Visualization of 3D shape feature distribution on (a) ScanObjectNN (SONN) of 15 classes (b) ModelNet40 (MN40) of 40 classes (c) RGBD of 51 classes.
...and 4 more figures

Theorems & Definitions (1)

Theorem 1: Correspondence between the Cartesian product of a view set and the correlation matrix in the attention mechanism

VSFormer: Mining Correlations in Flexible View Set for Multi-view 3D Shape Understanding

TL;DR

Abstract

VSFormer: Mining Correlations in Flexible View Set for Multi-view 3D Shape Understanding

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (1)