Table of Contents
Fetching ...

ORIGEN: Zero-Shot 3D Orientation Grounding in Text-to-Image Generation

Yunhong Min, Daehyeon Choi, Kyeongmin Yeo, Jihyun Lee, Minhyuk Sung

TL;DR

ORIGEN addresses zero-shot 3D orientation grounding in text-to-image generation for open-vocabulary, multi-object scenes by framing the problem as test-time reward-guided sampling with a one-step generative model. It defines a reward from GroundingDINO and OrientAnything and solves for latent vectors via Langevin dynamics, augmented with an adaptive time-rescaling mechanism to preserve realism while accelerating convergence. Empirical results on ORIBENCH-Single and ORIBENCH-Multi show ORIGEN achieving superior orientation grounding and strong text-to-image fidelity, supported by a user study endorsing its quality. The solution provides a practical, training-free pathway for precise 3D pose control in real-world, diverse image synthesis tasks, with broad applicability to layout-to-image and depth-guided generation as well.

Abstract

We introduce ORIGEN, the first zero-shot method for 3D orientation grounding in text-to-image generation across multiple objects and diverse categories. While previous work on spatial grounding in image generation has mainly focused on 2D positioning, it lacks control over 3D orientation. To address this, we propose a reward-guided sampling approach using a pretrained discriminative model for 3D orientation estimation and a one-step text-to-image generative flow model. While gradient-ascent-based optimization is a natural choice for reward-based guidance, it struggles to maintain image realism. Instead, we adopt a sampling-based approach using Langevin dynamics, which extends gradient ascent by simply injecting random noise--requiring just a single additional line of code. Additionally, we introduce adaptive time rescaling based on the reward function to accelerate convergence. Our experiments show that ORIGEN outperforms both training-based and test-time guidance methods across quantitative metrics and user studies.

ORIGEN: Zero-Shot 3D Orientation Grounding in Text-to-Image Generation

TL;DR

ORIGEN addresses zero-shot 3D orientation grounding in text-to-image generation for open-vocabulary, multi-object scenes by framing the problem as test-time reward-guided sampling with a one-step generative model. It defines a reward from GroundingDINO and OrientAnything and solves for latent vectors via Langevin dynamics, augmented with an adaptive time-rescaling mechanism to preserve realism while accelerating convergence. Empirical results on ORIBENCH-Single and ORIBENCH-Multi show ORIGEN achieving superior orientation grounding and strong text-to-image fidelity, supported by a user study endorsing its quality. The solution provides a practical, training-free pathway for precise 3D pose control in real-world, diverse image synthesis tasks, with broad applicability to layout-to-image and depth-guided generation as well.

Abstract

We introduce ORIGEN, the first zero-shot method for 3D orientation grounding in text-to-image generation across multiple objects and diverse categories. While previous work on spatial grounding in image generation has mainly focused on 2D positioning, it lacks control over 3D orientation. To address this, we propose a reward-guided sampling approach using a pretrained discriminative model for 3D orientation estimation and a one-step text-to-image generative flow model. While gradient-ascent-based optimization is a natural choice for reward-based guidance, it struggles to maintain image realism. Instead, we adopt a sampling-based approach using Langevin dynamics, which extends gradient ascent by simply injecting random noise--requiring just a single additional line of code. Additionally, we introduce adaptive time rescaling based on the reward function to accelerate convergence. Our experiments show that ORIGEN outperforms both training-based and test-time guidance methods across quantitative metrics and user studies.

Paper Structure

This paper contains 57 sections, 1 theorem, 29 equations, 11 figures, 10 tables, 1 algorithm.

Key Result

Proposition 1

Reward-Guided Langevin Dynamics. Let $q = \mathcal{N}(\mathbf{0}, \mathbf{I})$ denote the prior distribution, $\hat{\mathcal{R}}(\mathbf{x})$ be the pullback of a differentiable reward function, and $\mathbf{w}_t$ denote the standard Wiener process. As $t \rightarrow \infty$, the stationary distribu coincides with the optimal distribution of Eq. eq:reward_without_over_optimization.

Figures (11)

  • Figure 1: Toy experiment results. Top: latent space samples (blue); bottom: data space samples (red). Gray dots show the original distribution without reward guidance. From left to right: (1) ground truth target distribution from Eq. \ref{['eq:reward_without_over_optimization']}, (2) results of ReNO eyring2025reno, (3) results of ours with uniform time scaling, and (4) results of ours with reward-adaptive time rescaling.
  • Figure 2: Qualitative comparisons on ORIBENCH-Single benchmark (Sec. \ref{['subsubsec:ms_coco_single']}). Compared to the existing orientation-to-image models cheng2024learningliu2023zero, Origen generates the most realistic images, which also best align with the grounding conditions in the leftmost column.
  • Figure 3: Qualitative comparisons on ORIBENCH-Multi benchmark (Sec. \ref{['subsubsec:ms_coco_multi']}). Compared to the guided-generation methods chung2022diffusionhe2023manifoldyu2023freedomeyring2025reno, Origen generates the most realistic images, which also best align with the grounding conditions in the leftmost column.
  • Figure 4: Plot of our monitor function $\mathcal{G}(\hat{\mathcal{R}}(\mathbf{x}))$.
  • Figure 5: Qualitative comparisons on generally extended ORIBENCH-Single benchmark. Compared to other training-free approaches chung2022diffusionhe2023manifoldyu2023freedomeyring2025reno, Origen generates the best aligned images with the given orientation grounding conditions.
  • ...and 6 more figures

Theorems & Definitions (3)

  • Proposition 1
  • proof
  • proof : Proof of Proposition 1