Table of Contents
Fetching ...

Deep Models for Multi-View 3D Object Recognition: A Review

Mona Alzahrani, Muhammad Usman, Salma Kammoun, Saeed Anwar, Tarek Helmy

TL;DR

This paper surveys deep learning and transformer-based approaches for multi-view 3D object recognition, emphasizing how rendering multiple views around a 3D object and fusing per-view features yields state-of-the-art performance on classification and retrieval. It provides a comprehensive catalog of datasets (e.g., ModelNet40/10, ShapeNet Core55), camera configurations (Circular, Spherical, etc.), view-selection strategies, backbones, and fusion schemes, and it compares leading methods such as MVCNN, RotationNet, OVPT, MVMSAN, MVT, and ViewFormer. The review also highlights transformer-based architectures that enable cross-view attention and patch-level interactions, showing competitive or superior results over traditional view-based CNNs. Finally, it identifies key factors affecting performance—view count, backbone tuning, fusion strategy, lighting/color robustness, and transformer depth—and suggests directions to improve generalization and efficiency in future multi-view 3D recognition systems.

Abstract

Human decision-making often relies on visual information from multiple perspectives or views. In contrast, machine learning-based object recognition utilizes information from a single image of the object. However, the information conveyed by a single image may not be sufficient for accurate decision-making, particularly in complex recognition problems. The utilization of multi-view 3D representations for object recognition has thus far demonstrated the most promising results for achieving state-of-the-art performance. This review paper comprehensively covers recent progress in multi-view 3D object recognition methods for 3D classification and retrieval tasks. Specifically, we focus on deep learning-based and transformer-based techniques, as they are widely utilized and have achieved state-of-the-art performance. We provide detailed information about existing deep learning-based and transformer-based multi-view 3D object recognition models, including the most commonly used 3D datasets, camera configurations and number of views, view selection strategies, pre-trained CNN architectures, fusion strategies, and recognition performance on 3D classification and 3D retrieval tasks. Additionally, we examine various computer vision applications that use multi-view classification. Finally, we highlight key findings and future directions for developing multi-view 3D object recognition methods to provide readers with a comprehensive understanding of the field.

Deep Models for Multi-View 3D Object Recognition: A Review

TL;DR

This paper surveys deep learning and transformer-based approaches for multi-view 3D object recognition, emphasizing how rendering multiple views around a 3D object and fusing per-view features yields state-of-the-art performance on classification and retrieval. It provides a comprehensive catalog of datasets (e.g., ModelNet40/10, ShapeNet Core55), camera configurations (Circular, Spherical, etc.), view-selection strategies, backbones, and fusion schemes, and it compares leading methods such as MVCNN, RotationNet, OVPT, MVMSAN, MVT, and ViewFormer. The review also highlights transformer-based architectures that enable cross-view attention and patch-level interactions, showing competitive or superior results over traditional view-based CNNs. Finally, it identifies key factors affecting performance—view count, backbone tuning, fusion strategy, lighting/color robustness, and transformer depth—and suggests directions to improve generalization and efficiency in future multi-view 3D recognition systems.

Abstract

Human decision-making often relies on visual information from multiple perspectives or views. In contrast, machine learning-based object recognition utilizes information from a single image of the object. However, the information conveyed by a single image may not be sufficient for accurate decision-making, particularly in complex recognition problems. The utilization of multi-view 3D representations for object recognition has thus far demonstrated the most promising results for achieving state-of-the-art performance. This review paper comprehensively covers recent progress in multi-view 3D object recognition methods for 3D classification and retrieval tasks. Specifically, we focus on deep learning-based and transformer-based techniques, as they are widely utilized and have achieved state-of-the-art performance. We provide detailed information about existing deep learning-based and transformer-based multi-view 3D object recognition models, including the most commonly used 3D datasets, camera configurations and number of views, view selection strategies, pre-trained CNN architectures, fusion strategies, and recognition performance on 3D classification and 3D retrieval tasks. Additionally, we examine various computer vision applications that use multi-view classification. Finally, we highlight key findings and future directions for developing multi-view 3D object recognition methods to provide readers with a comprehensive understanding of the field.
Paper Structure (30 sections, 21 figures, 3 tables)

This paper contains 30 sections, 21 figures, 3 tables.

Figures (21)

  • Figure 1: Various representations of the 3D data.
  • Figure 2: 3D object recognition tasks: (a) 3D object classification, and (b) 3D object retrieval.
  • Figure 3: Example of multi-view 3D object representation.
  • Figure 4: The timeline shows the covered period of the existing 3D object recognition survey and the most relevant DL-based and transformer-based multi-view 3D object recognition methods developed in recent years. The timeline from 2015 (the first developed multi-view 3D object recognition method) until the present.
  • Figure 5: Example of the multi-view of same object in the computer vision field.
  • ...and 16 more figures