Table of Contents
Fetching ...

One Patch is All You Need: Joint Surface Material Reconstruction and Classification from Minimal Visual Cues

Sindhuja Penchala, Gavin Money, Gabriel Marques, Samuel Wood, Jessica Kirschman, Travis Atkison, Shahram Rahimi, Noorbakhsh Amiri Golilarz

TL;DR

The paper tackles surface material reconstruction and classification from minimal visual cues by proposing SMARC, a mask-aware, partial-convolution U-Net with a multi-scale classification head. It jointly performs inpainting and material recognition under only 10% visible input, achieving PSNR $17.55$ dB and accuracy $85.10\%$ on the Touch and Go dataset, outperforming five strong baselines (CAEs, ViT, MAE, Swin, DETR). SMARC emphasizes mask propagation and dilated partial convolutions to maintain spatial fidelity and semantic awareness in highly sparse observations, while delivering real-time inference at approximately $19.1$M parameters per second. The work demonstrates the practical viability of minimal-vision surface understanding for robotic perception, offering a foundation for robust shape-texture inference and material categorization in constrained environments.

Abstract

Understanding material surfaces from sparse visual cues is critical for applications in robotics, simulation, and material perception. However, most existing methods rely on dense or full-scene observations, limiting their effectiveness in constrained or partial view environment. To address this challenge, we introduce SMARC, a unified model for Surface MAterial Reconstruction and Classification from minimal visual input. By giving only a single 10% contiguous patch of the image, SMARC recognizes and reconstructs the full RGB surface while simultaneously classifying the material category. Our architecture combines a Partial Convolutional U-Net with a classification head, enabling both spatial inpainting and semantic understanding under extreme observation sparsity. We compared SMARC against five models including convolutional autoencoders [17], Vision Transformer (ViT) [13], Masked Autoencoder (MAE) [5], Swin Transformer [9], and DETR [2] using Touch and Go dataset [16] of real-world surface textures. SMARC achieves state-of-the-art results with a PSNR of 17.55 dB and a material classification accuracy of 85.10%. Our findings highlight the advantages of partial convolution in spatial reasoning under missing data and establish a strong foundation for minimal-vision surface understanding.

One Patch is All You Need: Joint Surface Material Reconstruction and Classification from Minimal Visual Cues

TL;DR

The paper tackles surface material reconstruction and classification from minimal visual cues by proposing SMARC, a mask-aware, partial-convolution U-Net with a multi-scale classification head. It jointly performs inpainting and material recognition under only 10% visible input, achieving PSNR dB and accuracy on the Touch and Go dataset, outperforming five strong baselines (CAEs, ViT, MAE, Swin, DETR). SMARC emphasizes mask propagation and dilated partial convolutions to maintain spatial fidelity and semantic awareness in highly sparse observations, while delivering real-time inference at approximately M parameters per second. The work demonstrates the practical viability of minimal-vision surface understanding for robotic perception, offering a foundation for robust shape-texture inference and material categorization in constrained environments.

Abstract

Understanding material surfaces from sparse visual cues is critical for applications in robotics, simulation, and material perception. However, most existing methods rely on dense or full-scene observations, limiting their effectiveness in constrained or partial view environment. To address this challenge, we introduce SMARC, a unified model for Surface MAterial Reconstruction and Classification from minimal visual input. By giving only a single 10% contiguous patch of the image, SMARC recognizes and reconstructs the full RGB surface while simultaneously classifying the material category. Our architecture combines a Partial Convolutional U-Net with a classification head, enabling both spatial inpainting and semantic understanding under extreme observation sparsity. We compared SMARC against five models including convolutional autoencoders [17], Vision Transformer (ViT) [13], Masked Autoencoder (MAE) [5], Swin Transformer [9], and DETR [2] using Touch and Go dataset [16] of real-world surface textures. SMARC achieves state-of-the-art results with a PSNR of 17.55 dB and a material classification accuracy of 85.10%. Our findings highlight the advantages of partial convolution in spatial reasoning under missing data and establish a strong foundation for minimal-vision surface understanding.

Paper Structure

This paper contains 19 sections, 1 equation, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Overview of SMARC. The network follows an encoder–bottleneck–decoder design with partial convolutions and explicit mask propagation. Skip connections fuse encoder features into the decoder for reconstruction, while a multi-scale head pools features from $s_3^y$, $s_4^y$, and the bottleneck $b$ for surface classification. The model restores occluded regions in the RGB image and simultaneously predicts the material class.
  • Figure 2: Confusion matrices of all six models evaluated on the Touch and Go dataset using 10% input crops. SMARC (f) shows strong diagonal dominance, reflecting robust class-wise performance.
  • Figure 3: ROC curves for all six models evaluated on the Touch and Go dataset. SMARC exhibits consistently higher AUC values across classes, highlighting its robust discrimination capabilities compared to other models.