Table of Contents
Fetching ...

Can We Change the Stroke Size for Easier Diffusion?

Yunwei Bai, Ying Kiat Tan, Yao Shu, Tsuhan Chen

Abstract

Diffusion models can be challenged in the low signal-to-noise regime, where they have to make pixel-level predictions despite the presence of high noise. The geometric intuition is akin to using the finest stroke for oil painting throughout, which may be ineffective. We therefore study stroke-size control as a controlled intervention that changes the effective roughness of the supervised target, predictions and perturbations across timesteps, in an attempt to ease the low signal-to-noise challenge. We analyze the advantages and trade-offs of the intervention both theoretically and empirically. Code will be released.

Can We Change the Stroke Size for Easier Diffusion?

Abstract

Diffusion models can be challenged in the low signal-to-noise regime, where they have to make pixel-level predictions despite the presence of high noise. The geometric intuition is akin to using the finest stroke for oil painting throughout, which may be ineffective. We therefore study stroke-size control as a controlled intervention that changes the effective roughness of the supervised target, predictions and perturbations across timesteps, in an attempt to ease the low signal-to-noise challenge. We analyze the advantages and trade-offs of the intervention both theoretically and empirically. Code will be released.

Paper Structure

This paper contains 88 sections, 7 theorems, 91 equations, 6 figures, 2 tables.

Key Result

Proposition 5.1

For the nearest-neighbor pooling-plus-upsampling stroke operator $S_k$ (block-average pooling with stride $k$ followed by nearest-neighbor upsampling; in our method $k=k_{\max}$) and any roughness schedule $\{w_t\}_{t=1}^T$ with $w_t<w_{\max}<1$, the population minimizer of $\mathcal{L}_{\mathrm{MS} Moreover, under $w_t<w_{\max}<1$ and Lemma lem:pool_bridge, $A_t$ is invertible, so conditioning on

Figures (6)

  • Figure 1: Samples under matched training and sampling budgets. DDPM ho2020denoising uses a fixed pixel-scale target at all timesteps, while stroke control uses coarser targets early.
  • Figure 2: Bucketed training losses across timestep regimes ("rough" to "fine"; CIFAR-10). MultiStroke reduces loss consistently across buckets that involve smoothening.
  • Figure 3: MultiStroke achieves similar gradient norm compared to DDPM.
  • Figure 4: Average log-magnitude DFT over 1,000 images for real data and samples from DDPM and MultiStroke at 10 and 20 steps (zero frequency centered). Brighter energy farther from the center corresponds to higher spatial frequencies. Under tight budgets, MultiStroke tends to bring peripheral energy closer to real; interpret together with FID/OCS, since oversmoothing or reduced diversity can also reduce peripheral energy.
  • Figure 5: Qualitative samples on CelebA-HQ at 10, 20, and 100 steps. Under matched budgets, MultiStroke often stabilizes early global structure, but aggressive early stroke control can leave some background regions less detailed. Later fine-detail timesteps can partially refine or visually conceal this bias, but the effect is timestep dependent. Best viewed in color with zoom.
  • ...and 1 more figures

Theorems & Definitions (13)

  • Proposition 5.1: Optimal predictor for MultiStroke
  • proof
  • Lemma 5.2: A conservative bridge for pooling plus upsampling
  • Proposition 5.3: Detail-subspace variance reduction
  • proof
  • Proposition 5.4: Detail-energy control with coarse-to-detail forcing
  • Lemma 1.1: Conditional expectation as $L^2$ projection
  • proof
  • proof : Proof of Proposition \ref{['prop:ms-conditional-mean']}
  • Proposition 1.2: Detail-subspace variance reduction
  • ...and 3 more