Table of Contents
Fetching ...

LCUDiff: Latent Capacity Upgrade Diffusion for Faithful Human Body Restoration

Jue Gong, Zihan Zhou, Jingkai Wang, Shu Li, Libo Liu, Jianliang Lan, Yulun Zhang

TL;DR

LCUDiff tackles fidelity gaps in human body restoration under degradation by upgrading the latent space of a pretrained latent diffusion model from $4$ to $16$ channels using Channel Splitting Distillation to keep anchor channels aligned while learning high-frequency details. It introduces Prior-Preserving Adaptation to smoothly bridge the mismatch between the frozen $4$-channel UNet and the expanded latent, and a Decoder Router that selects between decoders on a per-sample basis, yielding better pixel-level and perceptual fidelity without extra inference cost. The approach is validated on synthetic and real-world datasets, showing superior $ ext{DISTS}$ and $ ext{PSNR}/ ext{PSNRY}$ scores and improved no-reference metrics, while preserving one-step efficiency. The work provides open-source code and demonstrates practical improvements for robust HBR in real-world scenarios.

Abstract

Existing methods for restoring degraded human-centric images often struggle with insufficient fidelity, particularly in human body restoration (HBR). Recent diffusion-based restoration methods commonly adapt pre-trained text-to-image diffusion models, where the variational autoencoder (VAE) can significantly bottleneck restoration fidelity. We propose LCUDiff, a stable one-step framework that upgrades a pre-trained latent diffusion model from the 4-channel latent space to the 16-channel latent space. For VAE fine-tuning, channel splitting distillation (CSD) is used to keep the first four channels aligned with pre-trained priors while allocating the additional channels to effectively encode high-frequency details. We further design prior-preserving adaptation (PPA) to smoothly bridge the mismatch between 4-channel diffusion backbones and the higher-dimensional 16-channel latent. In addition, we propose a decoder router (DeR) for per-sample decoder routing using restoration-quality score annotations, which improves visual quality across diverse conditions. Experiments on synthetic and real-world datasets show competitive results with higher fidelity and fewer artifacts under mild degradations, while preserving one-step efficiency. The code and model will be at https://github.com/gobunu/LCUDiff.

LCUDiff: Latent Capacity Upgrade Diffusion for Faithful Human Body Restoration

TL;DR

LCUDiff tackles fidelity gaps in human body restoration under degradation by upgrading the latent space of a pretrained latent diffusion model from to channels using Channel Splitting Distillation to keep anchor channels aligned while learning high-frequency details. It introduces Prior-Preserving Adaptation to smoothly bridge the mismatch between the frozen -channel UNet and the expanded latent, and a Decoder Router that selects between decoders on a per-sample basis, yielding better pixel-level and perceptual fidelity without extra inference cost. The approach is validated on synthetic and real-world datasets, showing superior and scores and improved no-reference metrics, while preserving one-step efficiency. The work provides open-source code and demonstrates practical improvements for robust HBR in real-world scenarios.

Abstract

Existing methods for restoring degraded human-centric images often struggle with insufficient fidelity, particularly in human body restoration (HBR). Recent diffusion-based restoration methods commonly adapt pre-trained text-to-image diffusion models, where the variational autoencoder (VAE) can significantly bottleneck restoration fidelity. We propose LCUDiff, a stable one-step framework that upgrades a pre-trained latent diffusion model from the 4-channel latent space to the 16-channel latent space. For VAE fine-tuning, channel splitting distillation (CSD) is used to keep the first four channels aligned with pre-trained priors while allocating the additional channels to effectively encode high-frequency details. We further design prior-preserving adaptation (PPA) to smoothly bridge the mismatch between 4-channel diffusion backbones and the higher-dimensional 16-channel latent. In addition, we propose a decoder router (DeR) for per-sample decoder routing using restoration-quality score annotations, which improves visual quality across diverse conditions. Experiments on synthetic and real-world datasets show competitive results with higher fidelity and fewer artifacts under mild degradations, while preserving one-step efficiency. The code and model will be at https://github.com/gobunu/LCUDiff.
Paper Structure (14 sections, 8 equations, 7 figures, 5 tables)

This paper contains 14 sections, 8 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Trade-off between pixel-level and perceptual metrics on PERSONA-Val. Each point denotes a method, with DISTS on the x-axis and PSNR on the y-axis. The red star indicates our method. Inference time is measured on 512$\times$512 inputs using an NVIDIA RTX A6000. Most methods lie along a diagonal trend, suggesting that improving PSNR often comes at the cost of worse perceptual quality, or vice versa. Our method is located in the upper-left region, suggesting a better PSNR–DISTS balance.
  • Figure 2: Visual comparison of VAEs. Our VAE preserves subtle structures and reduces distortions compared with the SD VAE.
  • Figure 3: Model structure and training pipeline of our LCUDiff. Stage 1: We fine-tune a 16-channel VAE with channel splitting distillation (CSD). The first four anchor channels are aligned with the pretrained 4-channel latent space to preserve prior stability, while the remaining channels are optimized to encode additional high-frequency details. Stage 2: We train a one-step diffusion restoration model on the upgraded 16-channel latent space with prior-preserving adaptation (PPA). PPA builds two parallel input paths, an anchor-prior branch and a new 16-channel branch, and uses a fusion schedule to smoothly transition from the frozen prior pathway to the higher-dimensional latent pathway, stabilizing training without increasing inference overhead.
  • Figure 4: Training of decoder router (DeR). We build a preference dataset by decoding each restored latent with both the pretrained $\mathcal{D}_{4ch}$ and the fine-tuned $\mathcal{D}_{16ch}$. DeR takes the concatenation of $z_L$ and $\hat{z}_H$ as input and is trained with a soft BCE loss to predict the decoder preference. Here $\sigma(\cdot)$ denotes the sigmoid function, and the diffusion backbone and both decoders are kept frozen.
  • Figure 5: Visual comparison of the synthetic PERSONA-Val. Please zoom in for a better view.
  • ...and 2 more figures