Cue3D: Quantifying the Role of Image Cues in Single-Image 3D Generation
Xiang Li, Zirui Wang, Zixuan Huang, James M. Rehg
TL;DR
Cue3D provides a model-agnostic framework to dissect how monocular image cues drive single-image 3D generation. By perturbing cues such as shading, texture, silhouette, perspective, edges, and local continuity across seven diverse methods and two standard datasets, the study reveals that shape meaningfulness and shading are key drivers of generalization and geometry quality, while texture is less critical. The results show native 3D generative models consistently outperform others in 3D geometry and symmetry, with notable sensitivity to silhouettes and varying dependence on other cues across model families. These insights offer pathways toward more transparent, robust, and controllable single-image 3D generation and highlight the value of cue-aware design and evaluation. The work also provides a comprehensive benchmark and analyses that can guide future research in interpretable and reliable image-to-3D generation pipelines.
Abstract
Humans and traditional computer vision methods rely on a diverse set of monocular cues to infer 3D structure from a single image, such as shading, texture, silhouette, etc. While recent deep generative models have dramatically advanced single-image 3D generation, it remains unclear which image cues these methods actually exploit. We introduce Cue3D, the first comprehensive, model-agnostic framework for quantifying the influence of individual image cues in single-image 3D generation. Our unified benchmark evaluates seven state-of-the-art methods, spanning regression-based, multi-view, and native 3D generative paradigms. By systematically perturbing cues such as shading, texture, silhouette, perspective, edges, and local continuity, we measure their impact on 3D output quality. Our analysis reveals that shape meaningfulness, not texture, dictates generalization. Geometric cues, particularly shading, are crucial for 3D generation. We further identify over-reliance on provided silhouettes and diverse sensitivities to cues such as perspective and local continuity across model families. By dissecting these dependencies, Cue3D advances our understanding of how modern 3D networks leverage classical vision cues, and offers directions for developing more transparent, robust, and controllable single-image 3D generation models.
