Table of Contents
Fetching ...

GaussianLens: Localized High-Resolution Reconstruction via On-Demand Gaussian Densification

Yijia Weng, Zhicheng Wang, Songyou Peng, Saining Xie, Howard Zhou, Leonidas J. Guibas

TL;DR

This work tackles the challenge of reconstructing high-frequency details only where needed by formulating localized high-resolution reconstruction via on-demand Gaussian densification. The authors introduce GaussianLens, a cross-modal framework that densifies an initial low-resolution 3D Gaussian Splatting (3DGS) reconstruction within a user-specified RoI by fusing multi-view images with Gaussian features through a PointTransformer-based encoder and projection-based cross-attention, producing densified Gaussians as residuals. To handle substantial resolution increases, they add a pixel-guided densification pathway that spawns a Gaussian per RoI pixel, enabling faithful preservation of fine details. They validate on RealEstate10K and DL3DV, showing improved RoI detail, strong generalization to unseen Gaussian sources, and favorable efficiency compared with full high-resolution baselines, supported by ablations and a dedicated RoI view-synthesis benchmark.

Abstract

We perceive our surroundings with an active focus, paying more attention to regions of interest, such as the shelf labels in a grocery store. When it comes to scene reconstruction, this human perception trait calls for spatially varying degrees of detail ready for closer inspection in critical regions, preferably reconstructed on demand. While recent works in 3D Gaussian Splatting (3DGS) achieve fast, generalizable reconstruction from sparse views, their uniform resolution output leads to high computational costs unscalable to high-resolution training. As a result, they cannot leverage available images at their original high resolution to reconstruct details. Per-scene optimization methods reconstruct finer details with adaptive density control, yet require dense observations and lengthy offline optimization. To bridge the gap between the prohibitive cost of high-resolution holistic reconstructions and the user needs for localized fine details, we propose the problem of localized high-resolution reconstruction via on-demand Gaussian densification. Given a low-resolution 3DGS reconstruction, the goal is to learn a generalizable network that densifies the initial 3DGS to capture fine details in a user-specified local region of interest (RoI), based on sparse high-resolution observations of the RoI. This formulation avoids the high cost and redundancy of uniformly high-resolution reconstructions and fully leverages high-resolution captures in critical regions. We propose GaussianLens, a feed-forward densification framework that fuses multi-modal information from the initial 3DGS and multi-view images. We further design a pixel-guided densification mechanism that effectively captures details under large resolution increases. Experiments demonstrate our method's superior performance in local fine detail reconstruction and strong scalability to images of up to $1024\times1024$ resolution.

GaussianLens: Localized High-Resolution Reconstruction via On-Demand Gaussian Densification

TL;DR

This work tackles the challenge of reconstructing high-frequency details only where needed by formulating localized high-resolution reconstruction via on-demand Gaussian densification. The authors introduce GaussianLens, a cross-modal framework that densifies an initial low-resolution 3D Gaussian Splatting (3DGS) reconstruction within a user-specified RoI by fusing multi-view images with Gaussian features through a PointTransformer-based encoder and projection-based cross-attention, producing densified Gaussians as residuals. To handle substantial resolution increases, they add a pixel-guided densification pathway that spawns a Gaussian per RoI pixel, enabling faithful preservation of fine details. They validate on RealEstate10K and DL3DV, showing improved RoI detail, strong generalization to unseen Gaussian sources, and favorable efficiency compared with full high-resolution baselines, supported by ablations and a dedicated RoI view-synthesis benchmark.

Abstract

We perceive our surroundings with an active focus, paying more attention to regions of interest, such as the shelf labels in a grocery store. When it comes to scene reconstruction, this human perception trait calls for spatially varying degrees of detail ready for closer inspection in critical regions, preferably reconstructed on demand. While recent works in 3D Gaussian Splatting (3DGS) achieve fast, generalizable reconstruction from sparse views, their uniform resolution output leads to high computational costs unscalable to high-resolution training. As a result, they cannot leverage available images at their original high resolution to reconstruct details. Per-scene optimization methods reconstruct finer details with adaptive density control, yet require dense observations and lengthy offline optimization. To bridge the gap between the prohibitive cost of high-resolution holistic reconstructions and the user needs for localized fine details, we propose the problem of localized high-resolution reconstruction via on-demand Gaussian densification. Given a low-resolution 3DGS reconstruction, the goal is to learn a generalizable network that densifies the initial 3DGS to capture fine details in a user-specified local region of interest (RoI), based on sparse high-resolution observations of the RoI. This formulation avoids the high cost and redundancy of uniformly high-resolution reconstructions and fully leverages high-resolution captures in critical regions. We propose GaussianLens, a feed-forward densification framework that fuses multi-modal information from the initial 3DGS and multi-view images. We further design a pixel-guided densification mechanism that effectively captures details under large resolution increases. Experiments demonstrate our method's superior performance in local fine detail reconstruction and strong scalability to images of up to resolution.

Paper Structure

This paper contains 32 sections, 13 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: We introduce the problem of localized high-resolution reconstruction via on-demand Gaussian densification. While the majority of feed-forward models are confined to single-pass, uniform-resolution reconstruction, GaussianLens achieves low-cost, high-resolution local reconstruction by learning to densify low-resolution initial 3DGS reconstructions conditioned on high-resolution local observations.
  • Figure 2: Method overview. (a)-(d) illustrates GaussianLens, our feed-forward densification framework. It constructs multi-modal features for initial Gaussians and images (a), further extracts features via a PTv3-based encoder with projection-based cross attention (d) to images (b), and decodes them into residual parameters of densified Gaussians (c). (e) illustrates our pixel-guided densification. (f) shows the overall workflow.
  • Figure 3: Novel view synthesis on RealEstate10K zhou2018stereo and DL3DV ling2024dl3dv. Our method reconstructs finer details by effectively leveraging high-resolution observations and initial Gaussians.
  • Figure 4: Our densification results given input Gaussians from DepthSplat prediction or per-scene optimization. Our model achieves zero-shot generalization to per-scene optimized 3D Gaussians, improving upon initial reconstructions from both sources. The last two rows illustrate cases where per-scene optimization provides a more robust initialization, while DepthSplat struggles with sparse-view ambiguity, resulting in floaters in the third row, and wrongly angled door structure in the fourth row. Our model can leverage improved initial Gaussians and produce more accurate final reconstructions.
  • Figure 5: Breakdown visualization of source and output Gaussians. Starting from input Gaussians (b), our network takes input Gaussians in the specified RoI (c), and Gaussians from pixel-guided densification (d), and outputs updated versions of both (e, f). While the emergence of more details can already be observed in updated input Gaussians (e), Gaussians from pixel-guided densification (f) reconstruct sharp details more effectively.
  • ...and 8 more figures