Table of Contents
Fetching ...

LiTo: Surface Light Field Tokenization

Jen-Hao Rick Chang, Xiaoming Zhao, Dorian Chan, Oncel Tuzel

TL;DR

A 3D latent representation that jointly models object geometry and view-dependent appearance within a unified 3D latent space is proposed, enabling the generation of 3D objects with appearances consistent with the lighting and materials in the input.

Abstract

We propose a 3D latent representation that jointly models object geometry and view-dependent appearance. Most prior works focus on either reconstructing 3D geometry or predicting view-independent diffuse appearance, and thus struggle to capture realistic view-dependent effects. Our approach leverages that RGB-depth images provide samples of a surface light field. By encoding random subsamples of this surface light field into a compact set of latent vectors, our model learns to represent both geometry and appearance within a unified 3D latent space. This representation reproduces view-dependent effects such as specular highlights and Fresnel reflections under complex lighting. We further train a latent flow matching model on this representation to learn its distribution conditioned on a single input image, enabling the generation of 3D objects with appearances consistent with the lighting and materials in the input. Experiments show that our approach achieves higher visual quality and better input fidelity than existing methods.

LiTo: Surface Light Field Tokenization

TL;DR

A 3D latent representation that jointly models object geometry and view-dependent appearance within a unified 3D latent space is proposed, enabling the generation of 3D objects with appearances consistent with the lighting and materials in the input.

Abstract

We propose a 3D latent representation that jointly models object geometry and view-dependent appearance. Most prior works focus on either reconstructing 3D geometry or predicting view-independent diffuse appearance, and thus struggle to capture realistic view-dependent effects. Our approach leverages that RGB-depth images provide samples of a surface light field. By encoding random subsamples of this surface light field into a compact set of latent vectors, our model learns to represent both geometry and appearance within a unified 3D latent space. This representation reproduces view-dependent effects such as specular highlights and Fresnel reflections under complex lighting. We further train a latent flow matching model on this representation to learn its distribution conditioned on a single input image, enabling the generation of 3D objects with appearances consistent with the lighting and materials in the input. Experiments show that our approach achieves higher visual quality and better input fidelity than existing methods.
Paper Structure (35 sections, 10 equations, 15 figures, 11 tables)

This paper contains 35 sections, 10 equations, 15 figures, 11 tables.

Figures (15)

  • Figure 1: LiTo tokenizes surface light fields into a latent representation. It models 3D geometry and view-dependent appearance such as specular reflection. The figure shows reconstructions (first 3 columns) and single-image-to-3D results (last two columns). Mesh credit: asset_spartan_helmetasset_scooterasset_botasset_tractorasset_horse. See more on https://apple.github.io/ml-lito/.
  • Figure 2: Overview of the 3D latent representation. Given samples of the surface light field of the scene, we learn a latent representation that reconstruct the full surface light field information. The encoder (pink block) condenses input information into the latent representation. We jointly supervise the latent representation to contain full 3D geometry and view-dependent radiance information beyond the input samples. In the architectures, we design localized attention pattern to improve efficiency and support 1 million input tokens.
  • Figure 3: 3D patchification
  • Figure 4: Reconstruction results on various lighting conditions. Boxes on ground-truth highlight specular and Fresnel reflection. Please refer to Tab. \ref{['tab:eval recon toys4k short']} for quantitative results. Mesh credit: delicious_applegrinderknob.
  • Figure 5: Single image to 3D results. The input image is shown at the center of each set with black border. The rendering at the input view is shown with the input image. Please refer to Tab. \ref{['tab:eval gen toys4k short']} for quantitative results. Mesh credit: metal_foxlionsteampunkbeast.
  • ...and 10 more figures