Table of Contents
Fetching ...

Harnessing the Power of Foundation Models for Accurate Material Classification

Qingran Lin, Fengwei Yang, Chaolun Zhu

Abstract

Material classification has emerged as a critical task in computer vision and graphics, supporting the assignment of accurate material properties to a wide range of digital and real-world applications. While traditionally framed as an image classification task, this domain faces significant challenges due to the scarcity of annotated data, limiting the accuracy and generalizability of trained models. Recent advances in vision-language foundation models (VLMs) offer promising avenues to address these issues, yet existing solutions leveraging these models still exhibit unsatisfying results in material recognition tasks. In this work, we propose a novel framework that effectively harnesses foundation models to overcome data limitations and enhance classification accuracy. Our method integrates two key innovations: (a) a robust image generation and auto-labeling pipeline that creates a diverse and high-quality training dataset with material-centric images, and automatically assigns labels by fusing object semantics and material attributes in text prompts; (b) a prior incorporation strategy to distill information from VLMs, combined with a joint fine-tuning method that optimizes a pre-trained vision foundation model alongside VLM-derived priors, preserving broad generalizability while adapting to material-specific features.Extensive experiments demonstrate significant improvements on multiple datasets. We show that our synthetic dataset effectively captures the characteristics of real world materials, and the integration of priors from vision-language models significantly enhances the final performance. The source code and dataset will be released.

Harnessing the Power of Foundation Models for Accurate Material Classification

Abstract

Material classification has emerged as a critical task in computer vision and graphics, supporting the assignment of accurate material properties to a wide range of digital and real-world applications. While traditionally framed as an image classification task, this domain faces significant challenges due to the scarcity of annotated data, limiting the accuracy and generalizability of trained models. Recent advances in vision-language foundation models (VLMs) offer promising avenues to address these issues, yet existing solutions leveraging these models still exhibit unsatisfying results in material recognition tasks. In this work, we propose a novel framework that effectively harnesses foundation models to overcome data limitations and enhance classification accuracy. Our method integrates two key innovations: (a) a robust image generation and auto-labeling pipeline that creates a diverse and high-quality training dataset with material-centric images, and automatically assigns labels by fusing object semantics and material attributes in text prompts; (b) a prior incorporation strategy to distill information from VLMs, combined with a joint fine-tuning method that optimizes a pre-trained vision foundation model alongside VLM-derived priors, preserving broad generalizability while adapting to material-specific features.Extensive experiments demonstrate significant improvements on multiple datasets. We show that our synthetic dataset effectively captures the characteristics of real world materials, and the integration of priors from vision-language models significantly enhances the final performance. The source code and dataset will be released.
Paper Structure (35 sections, 10 equations, 10 figures, 8 tables)

This paper contains 35 sections, 10 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: With careful knowledge distillation from image generation network, we built generated image material dataset and propose an method for accurate material classification
  • Figure 2: Dataset generation workflow. Our pipeline synthesizes labeled material images through: (a) Hierarchical prompt engineering with LLM-guided plausibility filtering, (b) Diffusion-based image generation with model selection, (c) Semantic mask extraction, and (d) Region-aware material label assignment.
  • Figure 3: Generated material images using diverse prompts across 21 material categories, with one image per category.
  • Figure 4: Dual-stream architecture. (1) Vision Stream: DINOv2 extracts patch features from the masked region, aggregated via max-pooling. (2) Language Stream: GPT-4v generates material descriptors encoded by CLIP. The fused features are classified via MLP. (a) Grounded SAM semantic mask extraction (b) Feature aggregation (max-pooling + flatten)
  • Figure 5: PCA overlays of max-pooled DINOv2 patch features from 10 material classes across FMD, DMS, and Ours datasets. Axes are fixed to $[-40,40]$ on both dimensions. Each dataset is shown with scatter points and its covariance ellipse (two standard deviations), highlighting intra-class variation.
  • ...and 5 more figures