Mesh Denoising Transformer
Wenbo Zhao, Xianming Liu, Deming Zhai, Junjun Jiang, Xiangyang Ji
TL;DR
This work tackles mesh denoising by addressing two core challenges: loss of multi-attribute information from single-modal representations and limited global feature aggregation. It introduces Local Surface Descriptor (LSD), a multimodal representation that encodes local geometry as image-like patches and spatial context as a point cloud, enabling effective Transformer modeling. The SurfaceFormer framework employs a dual-stream Geometric Encoder and Spatial Encoder, followed by a Denoising Transformer to achieve global feature aggregation and robust denoising, with a Vertex Refinement step that aligns denoised vertices to normals. Experiments on synthetic, Kinect, real-scanned, and reconstructed datasets show state-of-the-art performance in both objective metrics $E_a$ and $E_v$ and subjective quality, demonstrating strong generalization and practical applicability for diverse scanning pipelines.
Abstract
Mesh denoising, aimed at removing noise from input meshes while preserving their feature structures, is a practical yet challenging task. Despite the remarkable progress in learning-based mesh denoising methodologies in recent years, their network designs often encounter two principal drawbacks: a dependence on single-modal geometric representations, which fall short in capturing the multifaceted attributes of meshes, and a lack of effective global feature aggregation, hindering their ability to fully understand the mesh's comprehensive structure. To tackle these issues, we propose SurfaceFormer, a pioneering Transformer-based mesh denoising framework. Our first contribution is the development of a new representation known as Local Surface Descriptor, which is crafted by establishing polar systems on each mesh face, followed by sampling points from adjacent surfaces using geodesics. The normals of these points are organized into 2D patches, mimicking images to capture local geometric intricacies, whereas the poles and vertex coordinates are consolidated into a point cloud to embody spatial information. This advancement surmounts the hurdles posed by the irregular and non-Euclidean characteristics of mesh data, facilitating a smooth integration with Transformer architecture. Next, we propose a dual-stream structure consisting of a Geometric Encoder branch and a Spatial Encoder branch, which jointly encode local geometry details and spatial information to fully explore multimodal information for mesh denoising. A subsequent Denoising Transformer module receives the multimodal information and achieves efficient global feature aggregation through self-attention operators. Our experimental evaluations demonstrate that this novel approach outperforms existing state-of-the-art methods in both objective and subjective assessments, marking a significant leap forward in mesh denoising.
