Table of Contents
Fetching ...

An Object is Worth 64x64 Pixels: Generating 3D Object via Image Diffusion

Xingguang Yan, Han-Hung Lee, Ziyu Wan, Angel X. Chang

TL;DR

This work introduces Object Images (omages), a 12-channel, 64×64 representation that encodes geometry, UV-induced patch structure, and PBR materials to enable diffusion-based 3D asset generation. By rasterizing meshes through UV-atlas repacking into regular 2D images, the method preserves topology and semantic patch information while remaining amenable to image-model architectures, specifically a Diffusion Transformer, trained on ABO data. The approach achieves competitive geometry quality (p-FID) relative to state-of-the-art 3D generators and naturally supports material generation, while also enabling efficient downsampling and boundary preservation. Limitations include non-watertight guarantees, dependence on good UV atlases, and the current 64-resolution constraint; future work targets higher resolutions and broader topological guarantees.

Abstract

We introduce a new approach for generating realistic 3D models with UV maps through a representation termed "Object Images." This approach encapsulates surface geometry, appearance, and patch structures within a 64x64 pixel image, effectively converting complex 3D shapes into a more manageable 2D format. By doing so, we address the challenges of both geometric and semantic irregularity inherent in polygonal meshes. This method allows us to use image generation models, such as Diffusion Transformers, directly for 3D shape generation. Evaluated on the ABO dataset, our generated shapes with patch structures achieve point cloud FID comparable to recent 3D generative models, while naturally supporting PBR material generation.

An Object is Worth 64x64 Pixels: Generating 3D Object via Image Diffusion

TL;DR

This work introduces Object Images (omages), a 12-channel, 64×64 representation that encodes geometry, UV-induced patch structure, and PBR materials to enable diffusion-based 3D asset generation. By rasterizing meshes through UV-atlas repacking into regular 2D images, the method preserves topology and semantic patch information while remaining amenable to image-model architectures, specifically a Diffusion Transformer, trained on ABO data. The approach achieves competitive geometry quality (p-FID) relative to state-of-the-art 3D generators and naturally supports material generation, while also enabling efficient downsampling and boundary preservation. Limitations include non-watertight guarantees, dependence on good UV atlases, and the current 64-resolution constraint; future work targets higher resolutions and broader topological guarantees.

Abstract

We introduce a new approach for generating realistic 3D models with UV maps through a representation termed "Object Images." This approach encapsulates surface geometry, appearance, and patch structures within a 64x64 pixel image, effectively converting complex 3D shapes into a more manageable 2D format. By doing so, we address the challenges of both geometric and semantic irregularity inherent in polygonal meshes. This method allows us to use image generation models, such as Diffusion Transformers, directly for 3D shape generation. Evaluated on the ABO dataset, our generated shapes with patch structures achieve point cloud FID comparable to recent 3D generative models, while naturally supporting PBR material generation.
Paper Structure (23 sections, 2 equations, 10 figures, 1 table)

This paper contains 23 sections, 2 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Visualization of geometry generation (top row) using diffusion for Object Images followed by material generation (right). The spatial coordinates (xyz) are visualized as rgb colors (see inset Object Images). The colors of the denoising mesh highlight different connected components. After generating the geometry, our model can generate PBR materials given the geometry as a condition. Other examples of generated shapes are shown in the 2nd row.
  • Figure 2: Comparison of different representations used for generation. Simplified meshes (left) often introduce topological errors and degenerated parts. Volumetric representations (middle) tend to merge touching parts together, struggle to model thin surfaces, and cannot handle open surfaces. In contrast, Our Object Images (right) effectively preserve the topology and structure of the original mesh.
  • Figure 3: Method overview. Left: We assume the mesh $\mathcal{M}$ has patch decomposition $\{S_i\}$, and has single-valued uv-map $f_i$ that flattens patch $S_i$ into the 2D uv-domain. Together with the material maps, Object Images can represent high-quality photo-realistic object. Right: We train the image diffusion generative model with Diffusion Transformer. The input noised Object Image, omg, is first flattened into a sequence before passing into the transformer to predict the clean $\text{omg}_0$.
  • Figure 4: Direct downscaling an omage from high-resolution (a) to lower resolution (b) usually leads to significant gaps between patches. By snapping the boundary vertices of the high resolution omage (f) into lower resolution via sparse pooling (e)(g), the gaps are significantly reduced (c)(d).
  • Figure 5: Examples of label-conditioned Omage-64 generation results. The left side displays results for 'ottoman', 'bed', 'exercise equipment', 'painting', 'lamp' 'vanity', 'plant pot', 'chair', 'pillow' and 'lamp'. Even at this resolution, thin structures are successfully generated. On the right, a scene with three objects generated by our method is shown, highlighting our capability in material generation.
  • ...and 5 more figures