Table of Contents
Fetching ...

ITS3D: Inference-Time Scaling for Text-Guided 3D Diffusion Models

Zhenglin Zhou, Fan Ma, Xiaobo Xia, Hehe Fan, Yi Yang, Tat-Seng Chua

TL;DR

<3-5 sentence high-level summary> ITS3D introduces an inference-time scaling framework for text-guided 3D diffusion models by optimizing the initial Gaussian noise input through a verifier-guided search. It stabilizes and accelerates search with Gaussian normalization, compresses the high-dimensional search space via SVD, and sustains exploration with a singular-space reset. Across GPTEval3D, ITS3D achieves state-of-the-art gains on human-preference, image-text alignment, and comprehensive 3D quality metrics without additional training. The approach demonstrates the practical value of structured, search-based inference-time optimization for 3D generation and points toward semantic-aware verifiers as a promising future direction.

Abstract

We explore inference-time scaling in text-guided 3D diffusion models to enhance generative quality without additional training. To this end, we introduce ITS3D, a framework that formulates the task as an optimization problem to identify the most effective Gaussian noise input. The framework is driven by a verifier-guided search algorithm, where the search algorithm iteratively refines noise candidates based on verifier feedback. To address the inherent challenges of 3D generation, we introduce three techniques for improved stability, efficiency, and exploration capability. 1) Gaussian normalization is applied to stabilize the search process. It corrects distribution shifts when noise candidates deviate from a standard Gaussian distribution during iterative updates. 2) The high-dimensional nature of the 3D search space increases computational complexity. To mitigate this, a singular value decomposition-based compression technique is employed to reduce dimensionality while preserving effective search directions. 3) To further prevent convergence to suboptimal local minima, a singular space reset mechanism dynamically updates the search space based on diversity measures. Extensive experiments demonstrate that ITS3D enhances text-to-3D generation quality, which shows the potential of computationally efficient search methods in generative processes. The source code is available at https://github.com/ZhenglinZhou/ITS3D.

ITS3D: Inference-Time Scaling for Text-Guided 3D Diffusion Models

TL;DR

<3-5 sentence high-level summary> ITS3D introduces an inference-time scaling framework for text-guided 3D diffusion models by optimizing the initial Gaussian noise input through a verifier-guided search. It stabilizes and accelerates search with Gaussian normalization, compresses the high-dimensional search space via SVD, and sustains exploration with a singular-space reset. Across GPTEval3D, ITS3D achieves state-of-the-art gains on human-preference, image-text alignment, and comprehensive 3D quality metrics without additional training. The approach demonstrates the practical value of structured, search-based inference-time optimization for 3D generation and points toward semantic-aware verifiers as a promising future direction.

Abstract

We explore inference-time scaling in text-guided 3D diffusion models to enhance generative quality without additional training. To this end, we introduce ITS3D, a framework that formulates the task as an optimization problem to identify the most effective Gaussian noise input. The framework is driven by a verifier-guided search algorithm, where the search algorithm iteratively refines noise candidates based on verifier feedback. To address the inherent challenges of 3D generation, we introduce three techniques for improved stability, efficiency, and exploration capability. 1) Gaussian normalization is applied to stabilize the search process. It corrects distribution shifts when noise candidates deviate from a standard Gaussian distribution during iterative updates. 2) The high-dimensional nature of the 3D search space increases computational complexity. To mitigate this, a singular value decomposition-based compression technique is employed to reduce dimensionality while preserving effective search directions. 3) To further prevent convergence to suboptimal local minima, a singular space reset mechanism dynamically updates the search space based on diversity measures. Extensive experiments demonstrate that ITS3D enhances text-to-3D generation quality, which shows the potential of computationally efficient search methods in generative processes. The source code is available at https://github.com/ZhenglinZhou/ITS3D.

Paper Structure

This paper contains 28 sections, 7 equations, 11 figures, 4 tables, 6 algorithms.

Figures (11)

  • Figure 1: Inspired by the achievements of inference-time scaling in LLMs snell2024scaling and 2D diffusion ma2025inferenceguo2025can, we explore its application in 3D diffusion models (e.g., GaussianCube zhang2024gaussiancube and TRELLIS xiang2024structured). We evaluate random search, zero-order search, and heuristic search yang2009firefly on GPTEval3D wu2024gpt, demonstrating consistent improvements in human preference scores wu2023human. Our search framework achieves quality enhancement across all settings without requiring additional training.
  • Figure 2: Overview of our inference-time scaling framework for text-guided 3D diffusion models. (1) Compressed search space: Gaussian noise undergoes SVD to initialize a lower-dimensional singular space, improving search efficiency. A singular space reset mechanism updates the search space when candidate diversity decreases, preventing convergence to local minima. (2) Gaussian normalization: The optimized noise is then reconstructed and passed through Gaussian normalization to maintain a standard normal distribution, stabilizing the search process. (3) Iterative search process: The refined noise is fed into a 3D diffusion model, evaluated by a search verifier, and iteratively refined by a search algorithm, ensuring improvement through inference-time scaling.
  • Figure 3: Illustrations of search algorithms. (a) Random search explores the search space by randomly sampling candidate points without considering prior evaluations. (b) Zero-order search iteratively refines the search by selecting the best candidate from a set of perturbed samples within a defined radius (dashed circle), updating the search pivot accordingly. (c) Heuristic search guides candidates toward higher-scoring regions based on heuristic-based movement strategies, such as attraction mechanisms yang2009firefly.
  • Figure 4: Ablation study on Gaussian normalization and compressed search space. We evaluate the effect of Gaussian normalization and compressed search space on random search (left), zero-order search (middle), and heuristic search (right). Across all search strategies, Gaussian normalization (termed as GN) improves stability and search effectiveness, while compressed search space (termed as CSS) with singular space reset (termed as SSR) further enhances efficiency and generation quality.
  • Figure 5: Qualitative comparisons on the GPTEval3D benchmark wu2024gpt. We compare the baselines zhang2024gaussiancubexiang2024structured (left) and the baselines with inference-time scaling (right) across multiple prompts. The baseline results often exhibit artifacts, a lack of detail, or inconsistencies in shape and texture. As a comparison, applying inference-time scaling can lead to higher fidelity, improved textures, and better structural coherence.
  • ...and 6 more figures