Table of Contents
Fetching ...

TRIM: Scalable 3D Gaussian Diffusion Inference with Temporal and Spatial Trimming

Zeyuan Yin, Xiaoming Liu

TL;DR

TRIM tackles slow 3D Gaussian diffusion by introducing temporal trajectory reduction and spatial instance masking as a post-training framework. A lightweight latent selector is trained offline to pick higher-quality intermediate latents at mid denoising, while a training-free corner-reference attention mask prunes background tokens, followed by post-denoising correction to remove artifacts. Empirical results on text-to-3D and image-to-3D demonstrate faster inference and higher semantic/al aesthetic quality, with improved CLIP-based metrics and reconstruction fidelity. The method is model-agnostic and avoids retraining, enabling effective inference-time scaling across backbones, though it relies on 2D priors and may modestly reduce diversity. Future work could pursue 3D-structure-aware diffusion and end-to-end spatial trimming for broader efficiency gains.

Abstract

Recent advances in 3D Gaussian diffusion models suffer from time-intensive denoising and post-denoising processing due to the massive number of Gaussian primitives, resulting in slow generation and limited scalability along sampling trajectories. To improve the efficiency of 3D diffusion models, we propose $\textbf{TRIM}$ ($\textbf{T}$rajectory $\textbf{R}$eduction and $\textbf{I}$nstance $\textbf{M}$ask denoising), a post-training approach that incorporates both temporal and spatial trimming strategies, to accelerate inference without compromising output quality while supporting the inference-time scaling for Gaussian diffusion models. Instead of scaling denoising trajectories in a costly end-to-end manner, we develop a lightweight selector model to evaluate latent Gaussian primitives derived from multiple sampled noises, enabling early trajectory reduction by selecting candidates with high-quality potential. Furthermore, we introduce instance mask denoising to prune learnable Gaussian primitives by filtering out redundant background regions, reducing inference computation at each denoising step. Extensive experiments and analysis demonstrate that TRIM significantly improves both the efficiency and quality of 3D generation. Source code is available at $\href{https://github.com/zeyuanyin/TRIM}{link}$.

TRIM: Scalable 3D Gaussian Diffusion Inference with Temporal and Spatial Trimming

TL;DR

TRIM tackles slow 3D Gaussian diffusion by introducing temporal trajectory reduction and spatial instance masking as a post-training framework. A lightweight latent selector is trained offline to pick higher-quality intermediate latents at mid denoising, while a training-free corner-reference attention mask prunes background tokens, followed by post-denoising correction to remove artifacts. Empirical results on text-to-3D and image-to-3D demonstrate faster inference and higher semantic/al aesthetic quality, with improved CLIP-based metrics and reconstruction fidelity. The method is model-agnostic and avoids retraining, enabling effective inference-time scaling across backbones, though it relies on 2D priors and may modestly reduce diversity. Future work could pursue 3D-structure-aware diffusion and end-to-end spatial trimming for broader efficiency gains.

Abstract

Recent advances in 3D Gaussian diffusion models suffer from time-intensive denoising and post-denoising processing due to the massive number of Gaussian primitives, resulting in slow generation and limited scalability along sampling trajectories. To improve the efficiency of 3D diffusion models, we propose (rajectory eduction and nstance ask denoising), a post-training approach that incorporates both temporal and spatial trimming strategies, to accelerate inference without compromising output quality while supporting the inference-time scaling for Gaussian diffusion models. Instead of scaling denoising trajectories in a costly end-to-end manner, we develop a lightweight selector model to evaluate latent Gaussian primitives derived from multiple sampled noises, enabling early trajectory reduction by selecting candidates with high-quality potential. Furthermore, we introduce instance mask denoising to prune learnable Gaussian primitives by filtering out redundant background regions, reducing inference computation at each denoising step. Extensive experiments and analysis demonstrate that TRIM significantly improves both the efficiency and quality of 3D generation. Source code is available at .

Paper Structure

This paper contains 15 sections, 2 equations, 10 figures, 7 tables, 1 algorithm.

Figures (10)

  • Figure 1: Comparison between baseline DiffSplat lin2025diffsplat (Rows 1&3) and our TRIM (Rows 2&4). For text-to-3D, TRIM yields houses more aligned with the “candy” characteristics. For image-to-3D, TRIM generates more realistic details on eyes, tails, and horns. TRIM also reduces inference time from 8 to 5 seconds.
  • Figure 2: Overview of our TRIM framework. It consists of three stages: given a text prompt A rocking horse with scroll-work, in the first stage, multiple denoising trajectory candidates are reduced to one trajectory with high-quality potential. In the second stage, an instance mask is performed to simplify background regions during denoising process. In the last stage, the Gaussian primitive parameters are corrected by the mask.
  • Figure 3: Details about temporal (left) and spatial (right) trimming schemes. In temporal trimming, high-quality trajectories are selected early using a lightweight selector. In spatial trimming, the mask is detected and utilized to separate and merge background tokens, reducing the number of tokens processed during denoising.
  • Figure 4: Qualitative results and comparisons on Text-to-3D (left) and Image-to-3D (right) generation.
  • Figure 5: Qualitative comparisons on inference step scaling. Best viewed with zoom.
  • ...and 5 more figures