X-Ray: A Sequential 3D Representation For Generation

Tao Hu; Wenhang Ge; Yuyang Zhao; Gim Hee Lee

X-Ray: A Sequential 3D Representation For Generation

Tao Hu, Wenhang Ge, Yuyang Zhao, Gim Hee Lee

TL;DR

The paper introduces X-Ray, a sequential 3D representation that encodes an object's full surface geometry—visible and hidden—into a multi-layer, video-like format using ray casting. It then builds a two-stage generative pipeline (X-Ray Diffusion Model + X-Ray Upsampler) that generates low- and high-resolution X-Rays conditioned on a single image and decodes them into 3D meshes via point clouds and Screened Poisson reconstruction. Empirically, X-Ray outperforms rendering-based baselines on single-view reconstruction benchmarks (GSO, OmniObject3D) and surpasses state-of-the-art 3D diffusion methods on ShapeNet Car generation, while substantially reducing data footprint by focusing on surfaces. The approach enables leveraging video diffusion techniques for 3D synthesis, achieving complete object reconstructions, including internal structures, and offering a scalable, generalizable path for 3D generation from images with practical applications in design and visualization.

Abstract

We introduce X-Ray, a novel 3D sequential representation inspired by the penetrability of x-ray scans. X-Ray transforms a 3D object into a series of surface frames at different layers, making it suitable for generating 3D models from images. Our method utilizes ray casting from the camera center to capture geometric and textured details, including depth, normal, and color, across all intersected surfaces. This process efficiently condenses the whole 3D object into a multi-frame video format, motivating the utilize of a network architecture similar to those in video diffusion models. This design ensures an efficient 3D representation by focusing solely on surface information. Also, we propose a two-stage pipeline to generate 3D objects from X-Ray Diffusion Model and Upsampler. We demonstrate the practicality and adaptability of our X-Ray representation by synthesizing the complete visible and hidden surfaces of a 3D object from a single input image. Experimental results reveal the state-of-the-art superiority of our representation in enhancing the accuracy of 3D generation, paving the way for new 3D representation research and practical applications.

X-Ray: A Sequential 3D Representation For Generation

TL;DR

Abstract

Paper Structure (23 sections, 4 equations, 8 figures, 4 tables)

This paper contains 23 sections, 4 equations, 8 figures, 4 tables.

Introduction
Related Work
Representation for 3D Models
Generative Models for 3D Generation
Our X-Ray Representation
Encoding
Decoding
X-Ray for 3D Generation
X-Ray Diffusion Model
X-Ray Upsampler
Experiments
Dataset and Implementation
Analysis of Encoding-Decoding Intrinsic Error
Quantitative Comparison
Qualitative Comparison
...and 8 more sections

Figures (8)

Figure 1: Comparison between the rendering-based 3D generation TripoSRLRM and our proposed X-Ray generation. The competitors focus on the visible outer surface in multiple camera views. In contrast, our model can sense both the visible and hidden surface in single camera view and generate the outer and inner surfaces of objects. An example of missing mesh interior from rendering-based 3D synthesis vs. complete mesh interior from our X-Ray generator are shown in the bottom row.
Figure 2: Samples of our X-Ray 3D sequential representation. Given a viewpoint, we captures the 3D attributes multi-layer surface frames, including hit $\mathbf{H}$, depth $\mathbf{D}$, normal $\mathbf{N}$, and color $\mathbf{C}$, in a video format. Noted that the number of frames in an X-Ray varies depending on the complexity of the 3D objects. The dotted yellow lines indicate the ray or sequence direction.
Figure 3: Overview of our proposed generative pipeline for the X-Ray 3D representation. There are three main components: (a) The X-Ray diffusion model, which generates a low-resolution X-Ray from an image input. (b) The upsampler, which enlarges the low-resolution X-Ray into $4\times$ high resolution. (c) The mesh decoding model, which decodes the high-resolution X-Ray into a point cloud with color and normal, and then converts it into the final generated mesh.
Figure 4: The encoding-decoding intrinsic error of different frame resolutions and number of layers.
Figure 5: Quantitative Comparison in Image-to-3D Generation.
...and 3 more figures

X-Ray: A Sequential 3D Representation For Generation

TL;DR

Abstract

X-Ray: A Sequential 3D Representation For Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)