Table of Contents
Fetching ...

MVTN: Learning Multi-View Transformations for 3D Understanding

Abdullah Hamdi, Faisal AlZahrani, Silvio Giancola, Bernard Ghanem

TL;DR

Fixed viewpoints in multi-view 3D recognition limit performance. The Multi-View Transformation Network (MVTN) regresses per-shape view-points and renders views with differentiable rendering, enabling end-to-end training with any multi-view classifier and supporting both meshes and point clouds. MVTN achieves state-of-the-art retrieval on ShapeNet Core55 and ModelNet40, strong classification on ScanObjectNN, and competitive ModelNet40 results, while improving rotation and occlusion robustness and extending to 3D segmentation; the authors also release MVTorch to facilitate research. The work situates learnable view selection as a practical, extensible component for modern 3D understanding pipelines, with broad implications for robustness and transferability across architectures.

Abstract

Multi-view projection techniques have shown themselves to be highly effective in achieving top-performing results in the recognition of 3D shapes. These methods involve learning how to combine information from multiple view-points. However, the camera view-points from which these views are obtained are often fixed for all shapes. To overcome the static nature of current multi-view techniques, we propose learning these view-points. Specifically, we introduce the Multi-View Transformation Network (MVTN), which uses differentiable rendering to determine optimal view-points for 3D shape recognition. As a result, MVTN can be trained end-to-end with any multi-view network for 3D shape classification. We integrate MVTN into a novel adaptive multi-view pipeline that is capable of rendering both 3D meshes and point clouds. Our approach demonstrates state-of-the-art performance in 3D classification and shape retrieval on several benchmarks (ModelNet40, ScanObjectNN, ShapeNet Core55). Further analysis indicates that our approach exhibits improved robustness to occlusion compared to other methods. We also investigate additional aspects of MVTN, such as 2D pretraining and its use for segmentation. To support further research in this area, we have released MVTorch, a PyTorch library for 3D understanding and generation using multi-view projections.

MVTN: Learning Multi-View Transformations for 3D Understanding

TL;DR

Fixed viewpoints in multi-view 3D recognition limit performance. The Multi-View Transformation Network (MVTN) regresses per-shape view-points and renders views with differentiable rendering, enabling end-to-end training with any multi-view classifier and supporting both meshes and point clouds. MVTN achieves state-of-the-art retrieval on ShapeNet Core55 and ModelNet40, strong classification on ScanObjectNN, and competitive ModelNet40 results, while improving rotation and occlusion robustness and extending to 3D segmentation; the authors also release MVTorch to facilitate research. The work situates learnable view selection as a practical, extensible component for modern 3D understanding pipelines, with broad implications for robustness and transferability across architectures.

Abstract

Multi-view projection techniques have shown themselves to be highly effective in achieving top-performing results in the recognition of 3D shapes. These methods involve learning how to combine information from multiple view-points. However, the camera view-points from which these views are obtained are often fixed for all shapes. To overcome the static nature of current multi-view techniques, we propose learning these view-points. Specifically, we introduce the Multi-View Transformation Network (MVTN), which uses differentiable rendering to determine optimal view-points for 3D shape recognition. As a result, MVTN can be trained end-to-end with any multi-view network for 3D shape classification. We integrate MVTN into a novel adaptive multi-view pipeline that is capable of rendering both 3D meshes and point clouds. Our approach demonstrates state-of-the-art performance in 3D classification and shape retrieval on several benchmarks (ModelNet40, ScanObjectNN, ShapeNet Core55). Further analysis indicates that our approach exhibits improved robustness to occlusion compared to other methods. We also investigate additional aspects of MVTN, such as 2D pretraining and its use for segmentation. To support further research in this area, we have released MVTorch, a PyTorch library for 3D understanding and generation using multi-view projections.
Paper Structure (38 sections, 2 equations, 25 figures, 17 tables)

This paper contains 38 sections, 2 equations, 25 figures, 17 tables.

Figures (25)

  • Figure 1: Multi-View Transformation Network (MVTN). We propose a differentiable module that predicts the best view-points for a task-specific multi-view network. MVTN is trained jointly with this network without any extra training supervision, while improving the performance on 3D classification and shape retrieval.
  • Figure 2: End-to-End Learning Pipeline for Multi-View Recognition. To learn adaptive scene parameters $\mathbf{u}$ that maximize the performance of a multi-view network $\mathbf{C}$ for every 3D object shape $\mathbf{S}$, we use a differentiable renderer $\mathbf{R}$. MVTN extracts coarse features from $\mathbf{S}$ by a point encoder and regresses the adaptive scene parameters for that object. In this example, the parameters $\mathbf{u}$ are the azimuth and elevation angles of cameras pointing towards the center of the object. The MVTN pipeline is optimized end-to-end for the task loss.
  • Figure 3: Multi-View Camera Configurations: The view setups commonly used in the multi-view literature are circular mvcnn or spherical mvviewgcnmvrotationnet. Our MVTN learns to predict specific view-points for each object shape at inference time. The shape's center is shown as a red dot, and the view-points as blue cameras with their mesh renderings shown at the bottom.
  • Figure 4: Multi-View Point Cloud Renderings. We show some examples of point cloud renderings used in our pipeline. Note how point cloud renderings offer more information about content hidden from the camera view-point (e.g. car wheels from the occluded side), which can be useful for recognition.
  • Figure 5: Qualitative Examples for Object Retrieval: (left): we show some query objects from the test set. (right): we show top five retrieved objects by our MVTN from the training set. Images of negative retrieved objects are framed.
  • ...and 20 more figures