RelaxFlow: Text-Driven Amodal 3D Generation

Jiayin Zhu; Guoji Fu; Xiaolu Liu; Qiyuan He; Yicong Li; Angela Yao

RelaxFlow: Text-Driven Amodal 3D Generation

Jiayin Zhu, Guoji Fu, Xiaolu Liu, Qiyuan He, Yicong Li, Angela Yao

TL;DR

This work formalizes text-driven amodal 3D generation, where text prompts steer the completion of unseen regions while strictly preserving input observation, and proposes RelaxFlow, a training-free dual-branch framework that decouples control granularity via a Multi-Prior Consensus Module and a Relaxation Mechanism.

Abstract

Image-to-3D generation faces inherent semantic ambiguity under occlusion, where partial observation alone is often insufficient to determine object category. In this work, we formalize text-driven amodal 3D generation, where text prompts steer the completion of unseen regions while strictly preserving input observation. Crucially, we identify that these objectives demand distinct control granularities: rigid control for the observation versus relaxed structural control for the prompt. To this end, we propose RelaxFlow, a training-free dual-branch framework that decouples control granularity via a Multi-Prior Consensus Module and a Relaxation Mechanism. Theoretically, we prove that our relaxation is equivalent to applying a low-pass filter on the generative vector field, which suppresses high-frequency instance details to isolate geometric structure that accommodates the observation. To facilitate evaluation, we introduce two diagnostic benchmarks, ExtremeOcc-3D and AmbiSem-3D. Extensive experiments demonstrate that RelaxFlow successfully steers the generation of unseen regions to match the prompt intent without compromising visual fidelity.

RelaxFlow: Text-Driven Amodal 3D Generation

TL;DR

Abstract

Paper Structure (58 sections, 6 theorems, 59 equations, 11 figures, 2 tables)

This paper contains 58 sections, 6 theorems, 59 equations, 11 figures, 2 tables.

Introduction
Related Works
ODE Flow Formulation for Dual-Branch 3D Generation
ODE Flow Formulations
Theoretical Justification: Low-Pass Relaxation and Stability
Low-pass relaxation operator.
Spectral analysis of semantic guidance
Stability and Wasserstein Bounds
RelaxFlow Framework
Backbone and Inference Setup
Multi-Prior Consensus
Prompt-to-prior as a visual interface.
Consensus over multiple priors.
Dual-Branch Sampling as ODE Interpolation
Realizing $\tilde{v}$ via logit smoothing.
...and 43 more sections

Key Result

Proposition 1.4

Let $\tilde{{\bm{v}}}_{\theta} = G_\sigma * \bar{{\bm{v}}}_{\theta}$ be the velocity field smoothed by a normalized Gaussian kernel $G_\sigma$. Under assump:sem-tubeassump:hf-mismatch, for a sufficiently small $\beta$, the semantic estimation error of the blurred field is strictly lower than that of

Figures (11)

Figure 1: The case for multiple plausible amodal 3D interpretations under occlusion. Feedforward image-to-3D model (e.g., SAM3D) collapses to a single overfitted bed-like shape, while our RelaxFlow resolves the ambiguity via text-driven amodal generation, allowing users to inject explicit textual intent and steer the generation toward alternative semantically consistent amodal 3D shapes.
Figure 2: Conceptual illustration of low-pass relaxation. The background depicts a conceptual compatibility landscape over latent states, where high density corresponds to low "energy". Star markers indicate a spurious trapped mode and the intended target mode. Throughout, we use corridor to denote a tube of latent states traced by integral curves of a conditioned velocity field while satisfying a constraint. Smoothing the prior-conditioned guidance thickens the semantic corridor and steers trajectories toward the intended mode while remaining compatible with the observation corridor.
Figure 3: RelaxFlow pipeline overview. An observation-driven branch preserves visible evidence, while the semantic guidance is injected through the intent prompt via multi-prior consensus and low-pass relaxation. Dual branches are fused via velocity blending to resolve occlusion-induced ambiguity.
Figure 4: Qualitative Comparisons.Top: Comparisons on AmbiSem-3D examples. Each contains two different intent prompts for the same observation. Our method preserves observation fidelity while enabling prompt-controlled completion. Bottom: The case in ExtremeOcc-3D with an intent prompt "bed". Under extreme occlusion, baselines either overfit the visible region or yield implausible shapes, whereas ours follows the category prior while maintaining visible evidence.
Figure 5: Ablation studies on ExtremeOcc-3D with SAM3D backbone. (a) Component ablation and hyperparameter sensitivity. Removing low-pass relaxation or visibility mask degrades performance; extreme $\rho$ or $\sigma$ values also hurt. "Prior from generation" uses Z-Image-generated priors instead of retrieved ones. (b) Effect of prior count $N$: moderate $N$ improves consensus, but too many priors introduce conflicts.
...and 6 more figures

Theorems & Definitions (12)

Definition 1.1: Wasserstein-$2$ distance
Proposition 1.4: Error reduction via low-pass filtering
proof
Proposition 1.6: Stability of Lipschitz Constant under Gaussian Smoothing
proof
Lemma 1.7: Stability analysis for ODE flows
proof
Theorem 1.8
proof
Theorem 1.9: Wasserstein Distance Bound
...and 2 more

RelaxFlow: Text-Driven Amodal 3D Generation

TL;DR

Abstract

RelaxFlow: Text-Driven Amodal 3D Generation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (12)