TRIM: Scalable 3D Gaussian Diffusion Inference with Temporal and Spatial Trimming
Zeyuan Yin, Xiaoming Liu
TL;DR
TRIM tackles slow 3D Gaussian diffusion by introducing temporal trajectory reduction and spatial instance masking as a post-training framework. A lightweight latent selector is trained offline to pick higher-quality intermediate latents at mid denoising, while a training-free corner-reference attention mask prunes background tokens, followed by post-denoising correction to remove artifacts. Empirical results on text-to-3D and image-to-3D demonstrate faster inference and higher semantic/al aesthetic quality, with improved CLIP-based metrics and reconstruction fidelity. The method is model-agnostic and avoids retraining, enabling effective inference-time scaling across backbones, though it relies on 2D priors and may modestly reduce diversity. Future work could pursue 3D-structure-aware diffusion and end-to-end spatial trimming for broader efficiency gains.
Abstract
Recent advances in 3D Gaussian diffusion models suffer from time-intensive denoising and post-denoising processing due to the massive number of Gaussian primitives, resulting in slow generation and limited scalability along sampling trajectories. To improve the efficiency of 3D diffusion models, we propose $\textbf{TRIM}$ ($\textbf{T}$rajectory $\textbf{R}$eduction and $\textbf{I}$nstance $\textbf{M}$ask denoising), a post-training approach that incorporates both temporal and spatial trimming strategies, to accelerate inference without compromising output quality while supporting the inference-time scaling for Gaussian diffusion models. Instead of scaling denoising trajectories in a costly end-to-end manner, we develop a lightweight selector model to evaluate latent Gaussian primitives derived from multiple sampled noises, enabling early trajectory reduction by selecting candidates with high-quality potential. Furthermore, we introduce instance mask denoising to prune learnable Gaussian primitives by filtering out redundant background regions, reducing inference computation at each denoising step. Extensive experiments and analysis demonstrate that TRIM significantly improves both the efficiency and quality of 3D generation. Source code is available at $\href{https://github.com/zeyuanyin/TRIM}{link}$.
