Table of Contents
Fetching ...

Coarse-Guided Visual Generation via Weighted h-Transform Sampling

Yanghao Wang, Ziqi Jiang, Zhen Wang, Long Chen

Abstract

Coarse-guided visual generation, which synthesizes fine visual samples from degraded or low-fidelity coarse references, is essential for various real-world applications. While training-based approaches are effective, they are inherently limited by high training costs and restricted generalization due to paired data collection. Accordingly, recent training-free works propose to leverage pretrained diffusion models and incorporate guidance during the sampling process. However, these training-free methods either require knowing the forward (fine-to-coarse) transformation operator, e.g., bicubic downsampling, or are difficult to balance between guidance and synthetic quality. To address these challenges, we propose a novel guided method by using the h-transform, a tool that can constrain stochastic processes (e.g., sampling process) under desired conditions. Specifically, we modify the transition probability at each sampling timestep by adding to the original differential equation with a drift function, which approximately steers the generation toward the ideal fine sample. To address unavoidable approximation errors, we introduce a noise-level-aware schedule that gradually de-weights the term as the error increases, ensuring both guidance adherence and high-quality synthesis. Extensive experiments across diverse image and video generation tasks demonstrate the effectiveness and generalization of our method.

Coarse-Guided Visual Generation via Weighted h-Transform Sampling

Abstract

Coarse-guided visual generation, which synthesizes fine visual samples from degraded or low-fidelity coarse references, is essential for various real-world applications. While training-based approaches are effective, they are inherently limited by high training costs and restricted generalization due to paired data collection. Accordingly, recent training-free works propose to leverage pretrained diffusion models and incorporate guidance during the sampling process. However, these training-free methods either require knowing the forward (fine-to-coarse) transformation operator, e.g., bicubic downsampling, or are difficult to balance between guidance and synthetic quality. To address these challenges, we propose a novel guided method by using the h-transform, a tool that can constrain stochastic processes (e.g., sampling process) under desired conditions. Specifically, we modify the transition probability at each sampling timestep by adding to the original differential equation with a drift function, which approximately steers the generation toward the ideal fine sample. To address unavoidable approximation errors, we introduce a noise-level-aware schedule that gradually de-weights the term as the error increases, ensuring both guidance adherence and high-quality synthesis. Extensive experiments across diverse image and video generation tasks demonstrate the effectiveness and generalization of our method.
Paper Structure (40 sections, 37 equations, 12 figures, 4 tables, 1 algorithm)

This paper contains 40 sections, 37 equations, 12 figures, 4 tables, 1 algorithm.

Figures (12)

  • Figure 1: Given a coarse visual sample (left one in each pair) as the guidance, our method generates its corresponding refined result (right one) in a training-free manner.
  • Figure 2: Existing and our solutions. (a) Training translation networks based on paired data, which is costly and non-generalizable to different types of coarse samples. (b) Solving inverse problems based on a known forward operator, making it is not robust. (c) Adding noise to the coarse sample and denoising it, which is difficult to balance the guidance and quality. (d) Our method leverages the $h$-transform to achieve training-free, operator-free, and stable coarse-guided generation.
  • Figure 3: Overview of Weighted $h$-Transform Sampling. (a) If we have $\textcolor{myred}{\bm{h}_{\bm{x}_0=\bm{y}}}$, the generation result will be the ideal sample. (b) We leverage $\textcolor{mygreen}{\bm{h}_{\bm{x}_0=\widetilde{\bm{y}}}}$ to approximate the untractable $\textcolor{myred}{\bm{h}_{\bm{x}_0=\bm{y}}}$ and derive that the error is increasing gradually during the sampling process. (c) To mitigate the error influence, we decrease the approximation weight and finally generate a high-quality refined sample.
  • Figure 4: Qualitative results of coarse-guided image generation. Compared with training-free SDEdit, our method shows more faithful synthesis across tasks. For fairness, we take their own commonly-used hyper-parameter for SDEdit ($t_0=500$) and ours ($\alpha=5$), and all other settings are the same.
  • Figure 5: Qualitative comparisons on the subset of DL3DV-10K. Our method shows better appearance alignment to the ground truth (see highlighted blue boxes).
  • ...and 7 more figures