Table of Contents
Fetching ...

SceneTok: A Compressed, Diffusable Token Space for 3D Scenes

Mohammad Asim, Christopher Wewer, Jan Eric Lenssen

TL;DR

SceneTok introduces a two-stage approach to 3D scene modeling by compressing view sets into an unstructured token space $\mathcal{Z} = \{\mathbf{z}_i\}_{i=1}^K$ (with $K$ around 1024) that can be rendered from novel trajectories using a lightweight rectified-flow decoder, and further enables fast latent-space generation via a diffusion transformer (SceneGen). The autoencoder (SceneTok) decouples rendering from generation, allowing a diffusion-based renderer to operate on a compact token set and enabling 32 views/second rendering and 8–11 seconds for full latent generation of scenes on a single GPU. The work demonstrates state-of-the-art reconstruction quality with dramatically smaller representations, robust transferability to novel trajectories, and efficient scene generation compared to 3D-space or view-space generation paradigms. The approach provides a scalable, data-efficient path for 3D scene synthesis and could benefit multi-modal, large-scale generative systems by exposing a compact, diffusable latent space for 3D content.

Abstract

We present SceneTok, a novel tokenizer for encoding view sets of scenes into a compressed and diffusable set of unstructured tokens. Existing approaches for 3D scene representation and generation commonly use 3D data structures or view-aligned fields. In contrast, we introduce the first method that encodes scene information into a small set of permutation-invariant tokens that is disentangled from the spatial grid. The scene tokens are predicted by a multi-view tokenizer given many context views and rendered into novel views by employing a light-weight rectified flow decoder. We show that the compression is 1-3 orders of magnitude stronger than for other representations while still reaching state-of-the-art reconstruction quality. Further, our representation can be rendered from novel trajectories, including ones deviating from the input trajectory, and we show that the decoder gracefully handles uncertainty. Finally, the highly-compressed set of unstructured latent scene tokens enables simple and efficient scene generation in 5 seconds, achieving a much better quality-speed trade-off than previous paradigms.

SceneTok: A Compressed, Diffusable Token Space for 3D Scenes

TL;DR

SceneTok introduces a two-stage approach to 3D scene modeling by compressing view sets into an unstructured token space (with around 1024) that can be rendered from novel trajectories using a lightweight rectified-flow decoder, and further enables fast latent-space generation via a diffusion transformer (SceneGen). The autoencoder (SceneTok) decouples rendering from generation, allowing a diffusion-based renderer to operate on a compact token set and enabling 32 views/second rendering and 8–11 seconds for full latent generation of scenes on a single GPU. The work demonstrates state-of-the-art reconstruction quality with dramatically smaller representations, robust transferability to novel trajectories, and efficient scene generation compared to 3D-space or view-space generation paradigms. The approach provides a scalable, data-efficient path for 3D scene synthesis and could benefit multi-modal, large-scale generative systems by exposing a compact, diffusable latent space for 3D content.

Abstract

We present SceneTok, a novel tokenizer for encoding view sets of scenes into a compressed and diffusable set of unstructured tokens. Existing approaches for 3D scene representation and generation commonly use 3D data structures or view-aligned fields. In contrast, we introduce the first method that encodes scene information into a small set of permutation-invariant tokens that is disentangled from the spatial grid. The scene tokens are predicted by a multi-view tokenizer given many context views and rendered into novel views by employing a light-weight rectified flow decoder. We show that the compression is 1-3 orders of magnitude stronger than for other representations while still reaching state-of-the-art reconstruction quality. Further, our representation can be rendered from novel trajectories, including ones deviating from the input trajectory, and we show that the decoder gracefully handles uncertainty. Finally, the highly-compressed set of unstructured latent scene tokens enables simple and efficient scene generation in 5 seconds, achieving a much better quality-speed trade-off than previous paradigms.
Paper Structure (55 sections, 6 equations, 30 figures, 10 tables)

This paper contains 55 sections, 6 equations, 30 figures, 10 tables.

Figures (30)

  • Figure 1: Setup Overview. We introduce SceneTok, a tokenizer that encodes view sets into an unstructured, highly-compressed set of tokens, which can be efficiently rendered (32 images per second) from novel trajectories with a light-weight generative decoder. The token space is diffusable and allows latent generation of scenes in 8 seconds.
  • Figure 2: Method Overview.(a) The SceneTok autoencoder encodes view sets into a set of compressed, unstructured scene tokens by chaining a VA-VAE image compressor and a perceiver module. The tokens can be rendered from novel views with a generative decoder based on rectified flows. (b) A latent diffusion transformer can perform scene generation by generating compressed scene tokens. Scene generation can be conditioned on a single or a few images and a set of anchor poses, defining the spatial scene extent.
  • Figure 3: A single scene perceiver block. The scene perceiver module consists of $L$ blocks, alternating self-attention between scene queries $\mathbf{Q}$ and cross-attention to the multi-view encoder branch, which is AdaLN-modulated with ray embeddings.
  • Figure 4: Qualitative NVS.$($, , $)$ denote the context views (only four is shown), the target rendering of SceneTok and the ground-truth target view (only one is shown) respectively. While the baselines suffer from blur artifacts even without compression, SceneTok produces cleaner renderings and better details from a small set of scene tokens. More examples in Appendix Sec. \ref{['sec:qualitative_results_add']}.
  • Figure 5: Qualitative single-view generation comparison.$($, , $)$ represents the input view, the rendered target views from the generated scene, and the ground-truth views respectively. DFM produces blurry results and struggles to generate complete structures. Although our method falls short in generating high-quality, high-frequency details when compared to DFoT, which performs pixel-space diffusion, and the foundational generative model SEVA, it is significantly more efficient (c.f. Tab. \ref{['tab:single_view_generation']}). More examples in Appendix Sec. \ref{['sec:qualitative_results_add']}.
  • ...and 25 more figures