RMAFF-PSN: A Residual Multi-Scale Attention Feature Fusion Photometric Stereo Network
Kai Luo, Yakun Ju, Lin Qi, Kaixuan Wang, Junyu Dong
TL;DR
This work tackles the difficulty of reconstructing accurate surface normals in photometric stereo for regions with intricate geometry and varying materials. It introduces RMAFF-PSN, a network that fuses multi-scale features from shallow high-resolution and deep low-resolution branches through a Residual Multi-scale Attention Feature Fusion module, employing channel and spatial attention to emphasize material-change and structure-rich areas. The model is trained with a cosine-similarity loss and uses max-pooling across illumination directions to address input-order ambiguity. Empirical results on DiLiGenT and additional real-world datasets demonstrate improved normal reconstruction, particularly for highly non-convex geometries and under sparse lighting, with robust ablation evidence supporting the effectiveness of the multi-scale attention fusion approach.
Abstract
Predicting accurate normal maps of objects from two-dimensional images in regions of complex structure and spatial material variations is challenging using photometric stereo methods due to the influence of surface reflection properties caused by variations in object geometry and surface materials. To address this issue, we propose a photometric stereo network called a RMAFF-PSN that uses residual multiscale attentional feature fusion to handle the ``difficult'' regions of the object. Unlike previous approaches that only use stacked convolutional layers to extract deep features from the input image, our method integrates feature information from different resolution stages and scales of the image. This approach preserves more physical information, such as texture and geometry of the object in complex regions, through shallow-deep stage feature extraction, double branching enhancement, and attention optimization. To test the network structure under real-world conditions, we propose a new real dataset called Simple PS data, which contains multiple objects with varying structures and materials. Experimental results on a publicly available benchmark dataset demonstrate that our method outperforms most existing calibrated photometric stereo methods for the same number of input images, especially in the case of highly non-convex object structures. Our method also obtains good results under sparse lighting conditions.
