$\mathbf{M^3A}$ Policy: Mutable Material Manipulation Augmentation Policy through Photometric Re-rendering
Jiayi Li, Yuxuan Hu, Haoran Geng, Xiangyu Chen, Chuhao Zhou, Ziteng Cui, Jianfei Yang
TL;DR
Mutable Material Manipulation Augmentation (M^3A) addresses material generalization in robotic manipulation by photometrically re-rendering a single real-world demonstration into multiple material variants. It uses Grounded-SAM2 for object masks, MiDaS for depth, CLIP+IP-Adapter for material exemplars, and a Stable Diffusion-based inpainting step to produce realistic material appearances while preserving motion trajectories, enabling large-scale multi-material demonstrations without new data collection. The authors introduce the M^3 benchmark on RoboVerse to evaluate cross-material generalization and zero-shot material transfer, and demonstrate substantial gains in real-world tasks (average 58.03% improvement) and robust zero-shot performance across unseen materials. This work provides a practical, scalable pathway toward material-agnostic robotic learning by decoupling appearance from manipulation behavior and leveraging diffusion-based imitation learning.
Abstract
Material generalization is essential for real-world robotic manipulation, where robots must interact with objects exhibiting diverse visual and physical properties. This challenge is particularly pronounced for objects made of glass, metal, or other materials whose transparent or reflective surfaces introduce severe out-of-distribution variations. Existing approaches either rely on simulated materials in simulators and perform sim-to-real transfer, which is hindered by substantial visual domain gaps, or depend on collecting extensive real-world demonstrations, which is costly, time-consuming, and still insufficient to cover various materials. To overcome these limitations, we resort to computational photography and introduce Mutable Material Manipulation Augmentation (M$^3$A), a unified framework that leverages the physical characteristics of materials as captured by light transport for photometric re-rendering. The core idea is simple yet powerful: given a single real-world demonstration, we photometrically re-render the scene to generate a diverse set of highly realistic demonstrations with different material properties. This augmentation effectively decouples task-specific manipulation skills from surface appearance, enabling policies to generalize across materials without additional data collection. To systematically evaluate this capability, we construct the first comprehensive multi-material manipulation benchmark spanning both simulation and real-world environments. Extensive experiments show that the M$^3$A policy significantly enhances cross-material generalization, improving the average success rate across three real-world tasks by 58.03\%, and demonstrating robust performance on previously unseen materials.
