Table of Contents
Fetching ...

Twinner: Shining Light on Digital Twins in a Few Snaps

Jesus Zarzar, Tom Monnier, Roman Shapovalov, Andrea Vedaldi, David Novotny

TL;DR

Twinner addresses digital twinning by enabling relighting and realistic rendering of objects from a few views. It introduces a memory-efficient tricolumn-based large reconstruction model that predicts geometry, PBR textures, and scene illumination, trained with synthetic data and real-world shading supervision. The key contributions include the tricolumn representation, procedurally generated PBR data, and a cubemap illumination predictor that allows learning from real data without ground-truth lighting. Experiments on StanfordORB show Twinner outperforms feed-forward baselines and rivals slow optimization methods in quality while being orders of magnitude faster, enabling practical digital twins.

Abstract

We present the first large reconstruction model, Twinner, capable of recovering a scene's illumination as well as an object's geometry and material properties from only a few posed images. Twinner is based on the Large Reconstruction Model and innovates in three key ways: 1) We introduce a memory-efficient voxel-grid transformer whose memory scales only quadratically with the size of the voxel grid. 2) To deal with scarcity of high-quality ground-truth PBR-shaded models, we introduce a large fully-synthetic dataset of procedurally-generated PBR-textured objects lit with varied illumination. 3) To narrow the synthetic-to-real gap, we finetune the model on real life datasets by means of a differentiable physically-based shading model, eschewing the need for ground-truth illumination or material properties which are challenging to obtain in real life. We demonstrate the efficacy of our model on the real life StanfordORB benchmark where, given few input views, we achieve reconstruction quality significantly superior to existing feedforward reconstruction networks, and comparable to significantly slower per-scene optimization methods.

Twinner: Shining Light on Digital Twins in a Few Snaps

TL;DR

Twinner addresses digital twinning by enabling relighting and realistic rendering of objects from a few views. It introduces a memory-efficient tricolumn-based large reconstruction model that predicts geometry, PBR textures, and scene illumination, trained with synthetic data and real-world shading supervision. The key contributions include the tricolumn representation, procedurally generated PBR data, and a cubemap illumination predictor that allows learning from real data without ground-truth lighting. Experiments on StanfordORB show Twinner outperforms feed-forward baselines and rivals slow optimization methods in quality while being orders of magnitude faster, enabling practical digital twins.

Abstract

We present the first large reconstruction model, Twinner, capable of recovering a scene's illumination as well as an object's geometry and material properties from only a few posed images. Twinner is based on the Large Reconstruction Model and innovates in three key ways: 1) We introduce a memory-efficient voxel-grid transformer whose memory scales only quadratically with the size of the voxel grid. 2) To deal with scarcity of high-quality ground-truth PBR-shaded models, we introduce a large fully-synthetic dataset of procedurally-generated PBR-textured objects lit with varied illumination. 3) To narrow the synthetic-to-real gap, we finetune the model on real life datasets by means of a differentiable physically-based shading model, eschewing the need for ground-truth illumination or material properties which are challenging to obtain in real life. We demonstrate the efficacy of our model on the real life StanfordORB benchmark where, given few input views, we achieve reconstruction quality significantly superior to existing feedforward reconstruction networks, and comparable to significantly slower per-scene optimization methods.

Paper Structure

This paper contains 35 sections, 11 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: We present Twinner, the first large reconstruction model capable of predicting a scene's illumination as well as an object's geometry and material properties from a few posed images. This enables tasks such as realistic relighting of objects in novel scenes in a few seconds by replacing costly per-scene optimizations with a single forward pass.
  • Figure 2: Overview of Twinner. The input views are processed together with their foreground masks and camera poses by an image tokenizer. The resulting tokens are processed by a Diffusion Transformer (DiT) model to predict a 3D volumetric representation of the scene, and a second DiT model to predict the scene's illumination. The 3D representation is then rendered to obtain images of the scene's material properties, normals, and opacity from the target's point of view. Using these rendered images together with the predicted illumination, Twinner renders an approximate physically-shaded image of the scene which then enters a photometric loss.
  • Figure 3: Tricolumn architecture. We present an overview of how the novel tricolumn representation is leveraged in our Twinner. The transformer module utilizes image tokens and a learnable embedding to predict the tricolumn representation consisting of a grid of $3R^2$ feature vectors of dimensionality $\mathbb{R}^{CR}$. These are then split into three axis-aligned column grids, each with dimensionality $\mathbb{R}^{CR}$. Finally, all axis-aligned column grids are reshaped into voxel grids of size $RxRxR$ with dimensionality $\mathbb{R}^{C}$ and summed into a single voxel grid.
  • Figure 4: Examples from our procedural dataset visualizing the shaded image, environment map, materials, and normals from our environment-map-endowed version of Zeroverse xie2024lrmzero.
  • Figure 5: StanfordORB Illumination Visualization. We visualize the illumination estimation of multiple baselines and our method on four different scenes from the StanfordORB dataset.
  • ...and 2 more figures