One Patch is All You Need: Joint Surface Material Reconstruction and Classification from Minimal Visual Cues
Sindhuja Penchala, Gavin Money, Gabriel Marques, Samuel Wood, Jessica Kirschman, Travis Atkison, Shahram Rahimi, Noorbakhsh Amiri Golilarz
TL;DR
The paper tackles surface material reconstruction and classification from minimal visual cues by proposing SMARC, a mask-aware, partial-convolution U-Net with a multi-scale classification head. It jointly performs inpainting and material recognition under only 10% visible input, achieving PSNR $17.55$ dB and accuracy $85.10\%$ on the Touch and Go dataset, outperforming five strong baselines (CAEs, ViT, MAE, Swin, DETR). SMARC emphasizes mask propagation and dilated partial convolutions to maintain spatial fidelity and semantic awareness in highly sparse observations, while delivering real-time inference at approximately $19.1$M parameters per second. The work demonstrates the practical viability of minimal-vision surface understanding for robotic perception, offering a foundation for robust shape-texture inference and material categorization in constrained environments.
Abstract
Understanding material surfaces from sparse visual cues is critical for applications in robotics, simulation, and material perception. However, most existing methods rely on dense or full-scene observations, limiting their effectiveness in constrained or partial view environment. To address this challenge, we introduce SMARC, a unified model for Surface MAterial Reconstruction and Classification from minimal visual input. By giving only a single 10% contiguous patch of the image, SMARC recognizes and reconstructs the full RGB surface while simultaneously classifying the material category. Our architecture combines a Partial Convolutional U-Net with a classification head, enabling both spatial inpainting and semantic understanding under extreme observation sparsity. We compared SMARC against five models including convolutional autoencoders [17], Vision Transformer (ViT) [13], Masked Autoencoder (MAE) [5], Swin Transformer [9], and DETR [2] using Touch and Go dataset [16] of real-world surface textures. SMARC achieves state-of-the-art results with a PSNR of 17.55 dB and a material classification accuracy of 85.10%. Our findings highlight the advantages of partial convolution in spatial reasoning under missing data and establish a strong foundation for minimal-vision surface understanding.
