Table of Contents
Fetching ...

Dynamic Graph Neural Network with Adaptive Features Selection for RGB-D Based Indoor Scene Recognition

Qiong Liu, Ruofei Xiong, Xingzhen Chen, Muyao Peng, You Yang

Abstract

Multi-modality of color and depth, i.e., RGB-D, is of great importance in recent research of indoor scene recognition. In this kind of data representation, depth map is able to describe the 3D structure of scenes and geometric relations among objects. Previous works showed that local features of both modalities are vital for promotion of recognition accuracy. However, the problem of adaptive selection and effective exploitation on these key local features remains open in this field. In this paper, a dynamic graph model is proposed with adaptive node selection mechanism to solve the above problem. In this model, a dynamic graph is built up to model the relations among objects and scene, and a method of adaptive node selection is proposed to take key local features from both modalities of RGB and depth for graph modeling. After that, these nodes are grouped by three different levels, representing near or far relations among objects. Moreover, the graph model is updated dynamically according to attention weights. Finally, the updated and optimized features of RGB and depth modalities are fused together for indoor scene recognition. Experiments are performed on public datasets including SUN RGB-D and NYU Depth v2. Extensive results demonstrate that our method has superior performance when comparing to state-of-the-arts methods, and show that the proposed method is able to exploit crucial local features from both modalities of RGB and depth.

Dynamic Graph Neural Network with Adaptive Features Selection for RGB-D Based Indoor Scene Recognition

Abstract

Multi-modality of color and depth, i.e., RGB-D, is of great importance in recent research of indoor scene recognition. In this kind of data representation, depth map is able to describe the 3D structure of scenes and geometric relations among objects. Previous works showed that local features of both modalities are vital for promotion of recognition accuracy. However, the problem of adaptive selection and effective exploitation on these key local features remains open in this field. In this paper, a dynamic graph model is proposed with adaptive node selection mechanism to solve the above problem. In this model, a dynamic graph is built up to model the relations among objects and scene, and a method of adaptive node selection is proposed to take key local features from both modalities of RGB and depth for graph modeling. After that, these nodes are grouped by three different levels, representing near or far relations among objects. Moreover, the graph model is updated dynamically according to attention weights. Finally, the updated and optimized features of RGB and depth modalities are fused together for indoor scene recognition. Experiments are performed on public datasets including SUN RGB-D and NYU Depth v2. Extensive results demonstrate that our method has superior performance when comparing to state-of-the-arts methods, and show that the proposed method is able to exploit crucial local features from both modalities of RGB and depth.

Paper Structure

This paper contains 24 sections, 10 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The framework of the proposed dynamic graph. First, RGB and depth images are input to two CNNs for feature extraction. Then, ANS module is used to select nodes to construct the multi-modality dynamic graph. In the last step, global and local features learned by the dynamic graph model are combined together for the final scene recognition.
  • Figure 2: The selected key local features (i.e., nodes) for single modality (e.g., RGB) graph construction. The contribution of each node is different from the task of scene recognition, and it should be evaluated properly in graph modeling. In our dynamic graph model, the importance of each node is represented by its value in the computed attention map. By taking $k$ = 16 as an example, nodes are organized by 3 levels, including 1 main-central (i.e., in red), 3 sub-central (i.e., in green) and 12 leaf nodes (i.e., in blue), in the dynamic graph model. The three sub-central nodes {$R_2$,$R_3$,$R_4$} are connected to the main-central node $R_1$ and the rest of the leaf nodes are connected to the sub-central node by euclidean distance. Similar process is performed for depth dynamic graph construction.
  • Figure 3: Connection between two modalities. Considering the semantic gap between the two modalities, a sparse connection is chosen which means main-central nodes, $R_1$ and $D_1$, sub-central nodes, {$R_2$,$R_3$,$R_4$} and {$D_2$,$D_3$,$D_4$} are connected separately.
  • Figure 4: The classification confusion matrix of the proposed dynamic graph model on the SUN RGB-D Dataset. The vertical axis shows the ground-truth classes, the horizontal axis shows the predicted classes. The classes on the horizontal axis are in the same order as those on the vertical axis.
  • Figure 5: The classification confusion matrix of proposed dynamic graph model on the NYU Depth v2 Dataset. The vertical axis shows the ground-truth classes. The horizontal axis shows the predicted classes.