Table of Contents
Fetching ...

Preserving Image Properties Through Initializations in Diffusion Models

Jeffrey Zhang, Shao-Yu Chang, Kedan Li, David Forsyth

TL;DR

This work identifies a critical mismatch between training and inference in diffusion models when starting from pure noise, which undermines production-ready image properties such as uniform backgrounds and consistent lighting. It introduces the PCA-K Offset framework, including PCA-K Offset Inference and PCA-K Offset Training (with Mean Offset as a special case), to align initialization distributions across training and inference and to preserve the full image distribution. The approach yields significant qualitative and quantitative improvements on retail garment images, and can be integrated with controllability methods like ControlNet to enhance generation reliability under strict design constraints. Practically, these techniques enable more controllable, on-brand diffusion-based image synthesis applicable to real-world visual merchandising and related domains.

Abstract

Retail photography imposes specific requirements on images. For instance, images may need uniform background colors, consistent model poses, centered products, and consistent lighting. Minor deviations from these standards impact a site's aesthetic appeal, making the images unsuitable for use. We show that Stable Diffusion methods, as currently applied, do not respect these requirements. The usual practice of training the denoiser with a very noisy image and starting inference with a sample of pure noise leads to inconsistent generated images during inference. This inconsistency occurs because it is easy to tell the difference between samples of the training and inference distributions. As a result, a network trained with centered retail product images with uniform backgrounds generates images with erratic backgrounds. The problem is easily fixed by initializing inference with samples from an approximation of noisy images. However, in using such an approximation, the joint distribution of text and noisy image at inference time still slightly differs from that at training time. This discrepancy is corrected by training the network with samples from the approximate noisy image distribution. Extensive experiments on real application data show significant qualitative and quantitative improvements in performance from adopting these procedures. Finally, our procedure can interact well with other control-based methods to further enhance the controllability of diffusion-based methods.

Preserving Image Properties Through Initializations in Diffusion Models

TL;DR

This work identifies a critical mismatch between training and inference in diffusion models when starting from pure noise, which undermines production-ready image properties such as uniform backgrounds and consistent lighting. It introduces the PCA-K Offset framework, including PCA-K Offset Inference and PCA-K Offset Training (with Mean Offset as a special case), to align initialization distributions across training and inference and to preserve the full image distribution. The approach yields significant qualitative and quantitative improvements on retail garment images, and can be integrated with controllability methods like ControlNet to enhance generation reliability under strict design constraints. Practically, these techniques enable more controllable, on-brand diffusion-based image synthesis applicable to real-world visual merchandising and related domains.

Abstract

Retail photography imposes specific requirements on images. For instance, images may need uniform background colors, consistent model poses, centered products, and consistent lighting. Minor deviations from these standards impact a site's aesthetic appeal, making the images unsuitable for use. We show that Stable Diffusion methods, as currently applied, do not respect these requirements. The usual practice of training the denoiser with a very noisy image and starting inference with a sample of pure noise leads to inconsistent generated images during inference. This inconsistency occurs because it is easy to tell the difference between samples of the training and inference distributions. As a result, a network trained with centered retail product images with uniform backgrounds generates images with erratic backgrounds. The problem is easily fixed by initializing inference with samples from an approximation of noisy images. However, in using such an approximation, the joint distribution of text and noisy image at inference time still slightly differs from that at training time. This discrepancy is corrected by training the network with samples from the approximate noisy image distribution. Extensive experiments on real application data show significant qualitative and quantitative improvements in performance from adopting these procedures. Finally, our procedure can interact well with other control-based methods to further enhance the controllability of diffusion-based methods.
Paper Structure (17 sections, 15 equations, 5 figures, 1 table)

This paper contains 17 sections, 15 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Despite training on images with properties (1)-(5), normal diffusion-based training and inference lead to unexpected results. (a) shows sample sequences from our garment dataset. (b) shows standard fine-tuning and inference results with Stable Diffusion rombach2021highresolution do not generate the same distribution of images despite being trained on images from (a). The prompts are taken from training data, where we expect the best results. To show that this is not a training error, in (c), we set a control experiment by changing $x_{start}$ (Eq. \ref{['eq:pca_offset_inference']}) to the training image shown in (a). The generated images match the training distribution, indicating that initialization information strongly influences results. (F*: "female garment, no person, white background"; M*: "male garment, no person, white background"; F†: "female person wearing garment"; M†: "male person wearing garment")
  • Figure 2: Intermediate outputs for a $S=20$ DDIM training process are visualized to show the first step of the diffusion process is out of distribution for standard DDIM training + inference. The top two rows show the intermediate outputs when initializing with noise (DDIM inference). The bottom two rows show the intermediate outputs when projecting a gray sweater with PCA-3 Offset Inference. Rows 1 and 3 show the noisy image $x_t$ and rows 2 and 4 show the predicted $\hat{x_0}_t$ at each time step $t$. We can see from row 2 that the first predicted $\hat{x_0}$ introduces a dark, non-uniform background that is propagated throughout the process, whereas in row 4, the predicted $\hat{x_0}$ is already close to the desired distribution, making the diffusion process is much more stable.
  • Figure 3: We show applying our PCA-0 and PCA-3 Offset Inference on DDIM Training can significantly improve generating desired image properties but is strongly biased by the sampled initialization $x_{start}$. This leads to some undesirable artifacts. Rows 1 and 2 show PCA-0 occasionally generates non-white backgrounds for garments due to faint sleeves in the mean image - violating property (3). In rows 3-6, generated images are strongly influenced by the color and shape of $x_{start}$ and "...black cotton flared pants..." are generated to be white, "...tailored straight leg trousers..." are generated as shorts, etc. This violates (property (1)) as the generated images do not respect the text and further indicate that $x_{start}$ strongly influences the denoiser. Descriptions are freeform text from fashion designers.
  • Figure 4: Using Mean Offset Training and Mean Offset Inference provides better text control because the relationship between initialization and text is preserved during training and inference. We apply two class mean initialization for garments and models and intentionally swap the means to test the effect of different initializations during inference. Figure (a) shows garment and model results DDIM Training + Mean Offset Inference that violate various properties. Figure (b) shows Mean Offset Training + Inference results satisfy all desired properties. Red boxes highlight generation errors in (a) and green boxes show they are fixed in (b). Red solid borders show artifacts that shouldn't exist and don't fully respect the text ((a) fails property (1)). Red dashed borders show generated models instead of garments, as specified by the text, and the person is not in the proper pose ((a) fails properties (1) and (4)). Red dotted borders show non-white backgrounds or cropped garments/models ((a) fails property (3)).
  • Figure 5: We adapt ControlNet zhang2023adding to take a garment condition to generate models wearing garments. We display three seeds for the same control to show that vanilla ControlNet (DDIM Training + Inference) consistently produces out-of-distribution results (violating properties (4) and (5)), whereas ControlNet with Mean Offset Training + Inference (Ours) perfectly preserves the desired training distribution.