Table of Contents
Fetching ...

DiffRetouch: Using Diffusion to Retouch on the Shoulder of Experts

Zheng-Peng Duan, Jiawei zhang, Zheng Lin, Xin Jin, Dongqing Zou, Chunle Guo, Chongyi Li

TL;DR

DiffRetouch addresses the subjectivity of image retouching by modeling a diverse fine-retouched distribution with diffusion, conditioned on the low-quality input $\mathbf{R}$ and a four-dimensional attribute vector $\mathbf{c} \in [-1,1]^4$. It builds a Stable Diffusion-based retouching framework that outputs a per-pixel affine transformation via an affine bilateral grid $\mathbf{A}$ to mitigate texture distortion, while employing cross-attention to map $\mathbf{c}$ into the network. A reconstruction-based training objective with latent $\boldsymbol{\epsilon}$-prediction and pixel-space fidelity, combined with a contrastive loss $\mathcal{L}_{cl}$, enforces that attribute adjustments produce perceptually aligned outputs; this enables flexible, user-driven styling without requiring extra exemplars. Empirical results on MIT-Adobe FiveK and PPR10K show improved perceptual quality, distributional similarity to expert retouchings, and stronger attribute controllability, with user studies favoring DiffRetouch. The approach supports practical, single-model multi-style retouching and sets the stage for broader attribute-driven image editing in low-level vision tasks, with code release planned.

Abstract

Image retouching aims to enhance the visual quality of photos. Considering the different aesthetic preferences of users, the target of retouching is subjective. However, current retouching methods mostly adopt deterministic models, which not only neglects the style diversity in the expert-retouched results and tends to learn an average style during training, but also lacks sample diversity during inference. In this paper, we propose a diffusion-based method, named DiffRetouch. Thanks to the excellent distribution modeling ability of diffusion, our method can capture the complex fine-retouched distribution covering various visual-pleasing styles in the training data. Moreover, four image attributes are made adjustable to provide a user-friendly editing mechanism. By adjusting these attributes in specified ranges, users are allowed to customize preferred styles within the learned fine-retouched distribution. Additionally, the affine bilateral grid and contrastive learning scheme are introduced to handle the problem of texture distortion and control insensitivity respectively. Extensive experiments have demonstrated the superior performance of our method on visually appealing and sample diversity. The code will be made available to the community.

DiffRetouch: Using Diffusion to Retouch on the Shoulder of Experts

TL;DR

DiffRetouch addresses the subjectivity of image retouching by modeling a diverse fine-retouched distribution with diffusion, conditioned on the low-quality input and a four-dimensional attribute vector . It builds a Stable Diffusion-based retouching framework that outputs a per-pixel affine transformation via an affine bilateral grid to mitigate texture distortion, while employing cross-attention to map into the network. A reconstruction-based training objective with latent -prediction and pixel-space fidelity, combined with a contrastive loss , enforces that attribute adjustments produce perceptually aligned outputs; this enables flexible, user-driven styling without requiring extra exemplars. Empirical results on MIT-Adobe FiveK and PPR10K show improved perceptual quality, distributional similarity to expert retouchings, and stronger attribute controllability, with user studies favoring DiffRetouch. The approach supports practical, single-model multi-style retouching and sets the stage for broader attribute-driven image editing in low-level vision tasks, with code release planned.

Abstract

Image retouching aims to enhance the visual quality of photos. Considering the different aesthetic preferences of users, the target of retouching is subjective. However, current retouching methods mostly adopt deterministic models, which not only neglects the style diversity in the expert-retouched results and tends to learn an average style during training, but also lacks sample diversity during inference. In this paper, we propose a diffusion-based method, named DiffRetouch. Thanks to the excellent distribution modeling ability of diffusion, our method can capture the complex fine-retouched distribution covering various visual-pleasing styles in the training data. Moreover, four image attributes are made adjustable to provide a user-friendly editing mechanism. By adjusting these attributes in specified ranges, users are allowed to customize preferred styles within the learned fine-retouched distribution. Additionally, the affine bilateral grid and contrastive learning scheme are introduced to handle the problem of texture distortion and control insensitivity respectively. Extensive experiments have demonstrated the superior performance of our method on visually appealing and sample diversity. The code will be made available to the community.
Paper Structure (27 sections, 15 equations, 25 figures, 6 tables, 2 algorithms)

This paper contains 27 sections, 15 equations, 25 figures, 6 tables, 2 algorithms.

Figures (25)

  • Figure 1: DiffRetouch supports editing the retouching style by adjusting the condition $\mathbf{c}$, where each coefficient $\mathbf{c}_i$ corresponds to one image attribute. We generate numerous results with $|\mathbf{c}_i|$ randomly sampled in [0,1], (1,2], and (2,3]. The features of these results extracted by style encoder song2021starenhancer are shown using t-SNE van2008visualizing. Since our DiffRetouch is trained with $|\mathbf{c}_i|$ limited to [0,1], the results sampled in this range are within the fine-retouched distribution surrounded by ground truths, otherwise, the results will deviate from it and be closer to low-quality images. This means that users can adjust within [0,1] to obtain preferred styles and meanwhile the final outputs tend to be objectively visual-pleasing.
  • Figure 2: Pipeline of our DiffRetouch. The sampling process and supervision during training are included. The baseline model part is marked in gray. The affine bilateral grid and $\mathcal{L}_{cl}$ are additionally introduced in DiffRetouch to tackle texture distortion and control insensitivity. During training, the denoising model takes the noisy latent $\mathbf{Z}_t$, resized version of $\mathbf{R}$ and condition $\mathbf{c}$ w.r.t. image attributes as input for each step, then generates $\mathbf{Z}_{t-1}$ and affine bilateral grid $\mathbf{A}_{t-1}$ simultaneously. After looking up in $\mathbf{A}$ based on the position and intensity of each pixel in $\mathbf{R}$, which is similar to gharbi2017deep, the output $\mathbf{D}$ is obtained by matrix multiply between the sliced affine matrics and pixel colors of $\mathbf{R}$. $\mathcal{L}_{rec}$ (Eq. (\ref{['eq:Lrec']})) is imposed in both the latent ($\mathbf{Z}$) and pixel ($\mathbf{D}$) space, along with the $\mathcal{L}_{cl}$ (Eq. (\ref{['eq:Lcl']})). During inference, at each step of the sampling, $\mathbf{Z}_{t-1}$ is used as the input of the next denoising step for the progressive denoising process. Only for the last step, $\mathbf{A}_{0}$ is used to obtain the final output $\mathbf{D}_0$.
  • Figure 3: Examples of Texture Distortion and Control Insensitivity. The top row: (a) Input image; (b) and (c) are the results generated w/o and w/ the affine bilateral grid. The bottom two rows: (d) and (e) are the results generated by the model w/o and w/ $\mathcal{L}_{cl}$; (f) are the results retouched by two experts as GT. The input condition $\mathbf{c}$ is shown on the left, where the adjusted attributes are contrast and color temperature. With $\mathcal{L}_{cl}$ (Eq. (\ref{['eq:Lcl']})), the region of the sky is closer to the expert-retouched results.
  • Figure 4: Framework of contrastive learning scheme. The regular branch takes the latent $\mathbf{Z}_0$, the noise map $\bm{\epsilon}$, and the condition $\mathbf{c}$ as input to generate the result $\mathbf{D}_t$. Another two branches produce the positive sample $\mathbf{D}^+_t$ with a different noise map $\bm{\epsilon'}$ and the same condition $\mathbf{c}$, and negative samples $\mathbf{D}^-_t$ with the same $\bm{\epsilon}$ and the opposite condition $\mathbf{c}^-$. For coefficients $\mathbf{c}_i \neq 0$, $\mathcal{L}_{cl}$ (Eq. (\ref{['eq:Lcl']})) steers the corresponding $\mathbf{s}_i$ closer to $\mathbf{s}^+_i$ and away from $\mathbf{s}^-_i$. In this example, the adjusted attributes are color temperature and brightness.
  • Figure 5: Qualitative comparison on MIT-Adobe FiveK dataset with subsets retouched by five experts (A/B/C/D/E). Since 3D-LUT zeng2020learning and CSRNet he2020conditional are unable to produce multiple retouching styles, only the results corresponding to Expert-C are displayed. The input condition $\mathbf{c}$ is shown at the top of each DiffRetouch generated result, along with the color histogram shown at the bottom left corner of the images. Results generated by our DiffRetouch are more similar to the corresponding expert-retouched result, especially for the color histogram.
  • ...and 20 more figures