Table of Contents
Fetching ...

CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering

Zhengqing Wang, Yuefan Wu, Jiacheng Chen, Fuyang Zhang, Yasutaka Furukawa

TL;DR

This paper proposes a neural rendering approach that represents a scene as compressed light-field tokens (CLiFTs), retaining rich appearance and geometric information of a scene, achieving significant data reduction with comparable rendering quality and the highest overall rendering score, while providing trade-offs of data size, rendering quality, and rendering speed.

Abstract

This paper proposes a neural rendering approach that represents a scene as "compressed light-field tokens (CLiFTs)", retaining rich appearance and geometric information of a scene. CLiFT enables compute-efficient rendering by compressed tokens, while being capable of changing the number of tokens to represent a scene or render a novel view with one trained network. Concretely, given a set of images, multi-view encoder tokenizes the images with the camera poses. Latent-space K-means selects a reduced set of rays as cluster centroids using the tokens. The multi-view ``condenser'' compresses the information of all the tokens into the centroid tokens to construct CLiFTs. At test time, given a target view and a compute budget (i.e., the number of CLiFTs), the system collects the specified number of nearby tokens and synthesizes a novel view using a compute-adaptive renderer. Extensive experiments on RealEstate10K and DL3DV datasets quantitatively and qualitatively validate our approach, achieving significant data reduction with comparable rendering quality and the highest overall rendering score, while providing trade-offs of data size, rendering quality, and rendering speed.

CLiFT: Compressive Light-Field Tokens for Compute-Efficient and Adaptive Neural Rendering

TL;DR

This paper proposes a neural rendering approach that represents a scene as compressed light-field tokens (CLiFTs), retaining rich appearance and geometric information of a scene, achieving significant data reduction with comparable rendering quality and the highest overall rendering score, while providing trade-offs of data size, rendering quality, and rendering speed.

Abstract

This paper proposes a neural rendering approach that represents a scene as "compressed light-field tokens (CLiFTs)", retaining rich appearance and geometric information of a scene. CLiFT enables compute-efficient rendering by compressed tokens, while being capable of changing the number of tokens to represent a scene or render a novel view with one trained network. Concretely, given a set of images, multi-view encoder tokenizes the images with the camera poses. Latent-space K-means selects a reduced set of rays as cluster centroids using the tokens. The multi-view ``condenser'' compresses the information of all the tokens into the centroid tokens to construct CLiFTs. At test time, given a target view and a compute budget (i.e., the number of CLiFTs), the system collects the specified number of nearby tokens and synthesizes a novel view using a compute-adaptive renderer. Extensive experiments on RealEstate10K and DL3DV datasets quantitatively and qualitatively validate our approach, achieving significant data reduction with comparable rendering quality and the highest overall rendering score, while providing trade-offs of data size, rendering quality, and rendering speed.

Paper Structure

This paper contains 15 sections, 1 equation, 13 figures, 2 tables, 1 algorithm.

Figures (13)

  • Figure 1: The training and the inference system overview. Top: The training consists of three steps: 1) Multi-view encoder, tokenizing the input images; 2) Latent K-means, selecting a representative set of tokens; and 3) Neural condensation, compressing the information of all the tokens into the representative set to produce Compressed Light-Field Tokens (CLiFTs). Bottom: At inference time, multi-view images are encoded into CLiFTs following the same process as in training. Given a target view, we collect a relevant set of CLiFTs with simple heuristics and render a novel view.
  • Figure 2: Main evaluation results on the RealEstate10K dataset (top) and the DL3DV dataset (bottom), comparing our approach with three baseline methods (LVSM-ED jin2024lvsm, DepthSplat xu2024depthsplat, and MVSplat chen2024mvsplat). The x-axis is the data size of the scene representation, and the y-axis is the rendering quality (PSNR, SSIM and LPIPS). Our approach CLiFT can flexibly change the data size (i.e., number of tokens) with one trained model, achieving significant data size reduction with comparable rendering quality and the highest overall PSNR, while providing trade-offs among data size, rendering quality, and rendering speed.
  • Figure 3: Ablation studies on our individual components, in particular, latent K-means and neural condensation. The plots compare three variants of our system by dropping latent K-means and neural condensation one by one from the system, while varying the data size. Specifically, the x-axis is the size of the scene representation. The y-axis is rendering quality (PSNR, LPIPS, and SSIM), rendering speed (FPS), or rendering cost (FLOPs), measured on an NVIDIA RTX A6000 GPU.
  • Figure 4: Qualitative rendering results of the baselines and ours with different data size (i.e., the number of CLiFT tokens for ours). Top: Ours vs. LVSM jin2024lvsm on RealEstate10K. Bottom: Ours vs. DepthSplat xu2024depthsplat on DL3DV. The PSNR value is recorded under each rendering.
  • Figure 5: Visualization of the latent K-means clustering, where $K$=$N_s$=128. Each color represents a cluster, and the yellow ring indicates the centroid token. Note that clustering is performed across multiple views, so a single cluster can span multiple images. As a result, some clusters may not have a visible centroid in a given image.
  • ...and 8 more figures