Table of Contents
Fetching ...

RMAFF-PSN: A Residual Multi-Scale Attention Feature Fusion Photometric Stereo Network

Kai Luo, Yakun Ju, Lin Qi, Kaixuan Wang, Junyu Dong

TL;DR

This work tackles the difficulty of reconstructing accurate surface normals in photometric stereo for regions with intricate geometry and varying materials. It introduces RMAFF-PSN, a network that fuses multi-scale features from shallow high-resolution and deep low-resolution branches through a Residual Multi-scale Attention Feature Fusion module, employing channel and spatial attention to emphasize material-change and structure-rich areas. The model is trained with a cosine-similarity loss and uses max-pooling across illumination directions to address input-order ambiguity. Empirical results on DiLiGenT and additional real-world datasets demonstrate improved normal reconstruction, particularly for highly non-convex geometries and under sparse lighting, with robust ablation evidence supporting the effectiveness of the multi-scale attention fusion approach.

Abstract

Predicting accurate normal maps of objects from two-dimensional images in regions of complex structure and spatial material variations is challenging using photometric stereo methods due to the influence of surface reflection properties caused by variations in object geometry and surface materials. To address this issue, we propose a photometric stereo network called a RMAFF-PSN that uses residual multiscale attentional feature fusion to handle the ``difficult'' regions of the object. Unlike previous approaches that only use stacked convolutional layers to extract deep features from the input image, our method integrates feature information from different resolution stages and scales of the image. This approach preserves more physical information, such as texture and geometry of the object in complex regions, through shallow-deep stage feature extraction, double branching enhancement, and attention optimization. To test the network structure under real-world conditions, we propose a new real dataset called Simple PS data, which contains multiple objects with varying structures and materials. Experimental results on a publicly available benchmark dataset demonstrate that our method outperforms most existing calibrated photometric stereo methods for the same number of input images, especially in the case of highly non-convex object structures. Our method also obtains good results under sparse lighting conditions.

RMAFF-PSN: A Residual Multi-Scale Attention Feature Fusion Photometric Stereo Network

TL;DR

This work tackles the difficulty of reconstructing accurate surface normals in photometric stereo for regions with intricate geometry and varying materials. It introduces RMAFF-PSN, a network that fuses multi-scale features from shallow high-resolution and deep low-resolution branches through a Residual Multi-scale Attention Feature Fusion module, employing channel and spatial attention to emphasize material-change and structure-rich areas. The model is trained with a cosine-similarity loss and uses max-pooling across illumination directions to address input-order ambiguity. Empirical results on DiLiGenT and additional real-world datasets demonstrate improved normal reconstruction, particularly for highly non-convex geometries and under sparse lighting, with robust ablation evidence supporting the effectiveness of the multi-scale attention fusion approach.

Abstract

Predicting accurate normal maps of objects from two-dimensional images in regions of complex structure and spatial material variations is challenging using photometric stereo methods due to the influence of surface reflection properties caused by variations in object geometry and surface materials. To address this issue, we propose a photometric stereo network called a RMAFF-PSN that uses residual multiscale attentional feature fusion to handle the ``difficult'' regions of the object. Unlike previous approaches that only use stacked convolutional layers to extract deep features from the input image, our method integrates feature information from different resolution stages and scales of the image. This approach preserves more physical information, such as texture and geometry of the object in complex regions, through shallow-deep stage feature extraction, double branching enhancement, and attention optimization. To test the network structure under real-world conditions, we propose a new real dataset called Simple PS data, which contains multiple objects with varying structures and materials. Experimental results on a publicly available benchmark dataset demonstrate that our method outperforms most existing calibrated photometric stereo methods for the same number of input images, especially in the case of highly non-convex object structures. Our method also obtains good results under sparse lighting conditions.
Paper Structure (4 sections, 8 equations, 12 figures, 2 tables)

This paper contains 4 sections, 8 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Visualization of structurally complex areas using error maps. The number represents the mean angular error (MAE) of the object. We use green boxes to indicate the material change area, yellow boxes to indicate additional shadows, red boxes to indicate complex areas, and magenta boxes to indicate diffuse reflections. Through our proposed method, we have observed that the accuracy of the restoration process in these areas is significantly improved, as can be seen from the error maps.
  • Figure 2: An example of some images with different light directions. In the red box, we illustrate a situation where an object surface point with a normal vector $\bm{n}$ is illuminated by an infinitely distant point light source in a direction $\bm{l}$, and is observed by a camera in a view direction $\bm{v}$. When $\bm{n}^{T} \bm{l_{j}}<0$, an additional shadows occur, and a cast shadows appear when the light is occluded by the object.
  • Figure 3: RMAFF-PSN network architecture. The number underneath each layer refers to the number of the channel that is used in the convolution.
  • Figure 4: Structure diagram of RMAFF module. It uses residual-like blocks to expand the field view while adaptively adding attention weights to feature information.
  • Figure 5: Imaging setup for building the Simple PS data. We built a fabricated shelf and covered it with black cloth to simulate darkroom conditions. The camera is fixed at the top of the shelf. Six light sources were installed around the iron ring, and the target object was placed directly below the camera. The blue line shows the detailed device and location information, and the green line shows the height of the device from the ground.
  • ...and 7 more figures