Table of Contents
Fetching ...

Beyond Viewpoint: Robust 3D Object Recognition under Arbitrary Views through Joint Multi-Part Representation

Linlong Fan, Ye Huang, Yanqi Ge, Wen Li, Lixin Duan

TL;DR

The paper tackles 3D object recognition under arbitrary views where object poses and the number of viewpoints vary and inputs are unaligned. It proposes PANet, a part-based representation that localizes discriminative parts in each view via weakly supervised cues, refines cross-view part information with an Adaptive Part Refinement transformer, and combines multiple global parts into a robust object descriptor. The approach introduces a cross-view association mechanism and part-aware loss to ensure diverse, informative part features, achieving state-of-the-art performance on ScanObjectNN, ModelNet, and RGBD, especially in arbitrary-view settings. These results demonstrate the practical impact of part-level, view-robust representations for real-world 3D recognition tasks and offer improved interpretability through part-level reasoning.

Abstract

Existing view-based methods excel at recognizing 3D objects from predefined viewpoints, but their exploration of recognition under arbitrary views is limited. This is a challenging and realistic setting because each object has different viewpoint positions and quantities, and their poses are not aligned. However, most view-based methods, which aggregate multiple view features to obtain a global feature representation, hard to address 3D object recognition under arbitrary views. Due to the unaligned inputs from arbitrary views, it is challenging to robustly aggregate features, leading to performance degradation. In this paper, we introduce a novel Part-aware Network (PANet), which is a part-based representation, to address these issues. This part-based representation aims to localize and understand different parts of 3D objects, such as airplane wings and tails. It has properties such as viewpoint invariance and rotation robustness, which give it an advantage in addressing the 3D object recognition problem under arbitrary views. Our results on benchmark datasets clearly demonstrate that our proposed method outperforms existing view-based aggregation baselines for the task of 3D object recognition under arbitrary views, even surpassing most fixed viewpoint methods.

Beyond Viewpoint: Robust 3D Object Recognition under Arbitrary Views through Joint Multi-Part Representation

TL;DR

The paper tackles 3D object recognition under arbitrary views where object poses and the number of viewpoints vary and inputs are unaligned. It proposes PANet, a part-based representation that localizes discriminative parts in each view via weakly supervised cues, refines cross-view part information with an Adaptive Part Refinement transformer, and combines multiple global parts into a robust object descriptor. The approach introduces a cross-view association mechanism and part-aware loss to ensure diverse, informative part features, achieving state-of-the-art performance on ScanObjectNN, ModelNet, and RGBD, especially in arbitrary-view settings. These results demonstrate the practical impact of part-level, view-robust representations for real-world 3D recognition tasks and offer improved interpretability through part-level reasoning.

Abstract

Existing view-based methods excel at recognizing 3D objects from predefined viewpoints, but their exploration of recognition under arbitrary views is limited. This is a challenging and realistic setting because each object has different viewpoint positions and quantities, and their poses are not aligned. However, most view-based methods, which aggregate multiple view features to obtain a global feature representation, hard to address 3D object recognition under arbitrary views. Due to the unaligned inputs from arbitrary views, it is challenging to robustly aggregate features, leading to performance degradation. In this paper, we introduce a novel Part-aware Network (PANet), which is a part-based representation, to address these issues. This part-based representation aims to localize and understand different parts of 3D objects, such as airplane wings and tails. It has properties such as viewpoint invariance and rotation robustness, which give it an advantage in addressing the 3D object recognition problem under arbitrary views. Our results on benchmark datasets clearly demonstrate that our proposed method outperforms existing view-based aggregation baselines for the task of 3D object recognition under arbitrary views, even surpassing most fixed viewpoint methods.
Paper Structure (12 sections, 6 equations, 5 figures, 8 tables)

This paper contains 12 sections, 6 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: In practical scenarios, 3D objects are observed from arbitrary views(right). The definition of arbitrary views is that each object is unaligned, and the viewpoint positions and quantities vary for each object. Previous works have focused on 3D object recognition in aligned views(left), where the pose of each object is aligned, and the positions and quantities of viewpoints are predefined. Rotated views(middle) is an extension that introduces random rotation for each 3D object while keeping the viewpoint positions and quantities unchanged. Our work focuses on arbitrary views.
  • Figure 2: Comparison between multi-view aggregation representation (a) and part-based representation (b). View aggregation methods integrate features from different views and use the aggregated features to represent 3D objects. However, since the views are not aligned, the aggregated features lack robustness, making it difficult to handle 3D recognition on arbitrary views. Additionally, view aggregation methods suffer from information loss, thus unable to fully utilize the features from each view. We newly propose a part-based representation that focuses on multi-view part awareness and combines multiple parts to robustly represent 3D objects.
  • Figure 3: Overview of our proposed Part-Aware Network (PANet). The network takes multi-view images as input, encodes them by CNN, obtains view features $\textbf{I}$, and uses the cross-view association (CVA) module to enhance the view features. Firstly, employ weakly-supervised methods to perceive part regions and generate view part sequences $\textbf{T}$. Then, given some learnable part tokens $\textbf{P}$, the sequences $\textbf{T}$ and $\textbf{P}$ are concatenated and input to the adaptive part refinement (APR) module. The module aims to refine the sequence $\textbf{T}$ into a more compact representation, resulting in the global part features. Finally, multiple global parts $\underline{\textbf{P}}$ together serve as a representation of a 3D object.
  • Figure 4: Visualization of correlations between part features $\underline{\textbf{P}}$. The left figure illustrates the correlation between $\underline{\textbf{P}}$ after applying $\mathcal{L}_{awe}$.
  • Figure 5: Comparing the semantic differences between view-based aggregation methods and PANet in feature extraction.