Table of Contents
Fetching ...

ArtiFade: Learning to Generate High-quality Subject from Blemished Images

Shuya Yang, Shaozhe Hao, Yukang Cao, Kwan-Yee K. Wong

TL;DR

ArtiFade tackles blemished subject-driven generation by aligning unblemished and blemished training data and fine-tuning only selective diffusion-model components along with an artifact-free textual embedding. The method constructs paired data of unblemished and blemished images, applies Textual Inversion to obtain blemished embeddings, and optimizes a dedicated artifact-free embedding while fine-tuning cross-attention keys and values to reconstruct clean subject images. A bespoke evaluation benchmark and comprehensive experiments demonstrate superior artifact removal and subject fidelity in both in-distribution and out-of-distribution scenarios, including compatibility with DreamBooth via LoRA. The work offers a practical, generalizable solution for real-world image collections containing artifacts such as watermarks, stickers, or adversarial noise, enabling robust subject-driven generation in diverse settings.

Abstract

Subject-driven text-to-image generation has witnessed remarkable advancements in its ability to learn and capture characteristics of a subject using only a limited number of images. However, existing methods commonly rely on high-quality images for training and may struggle to generate reasonable images when the input images are blemished by artifacts. This is primarily attributed to the inadequate capability of current techniques in distinguishing subject-related features from disruptive artifacts. In this paper, we introduce ArtiFade to tackle this issue and successfully generate high-quality artifact-free images from blemished datasets. Specifically, ArtiFade exploits fine-tuning of a pre-trained text-to-image model, aiming to remove artifacts. The elimination of artifacts is achieved by utilizing a specialized dataset that encompasses both unblemished images and their corresponding blemished counterparts during fine-tuning. ArtiFade also ensures the preservation of the original generative capabilities inherent within the diffusion model, thereby enhancing the overall performance of subject-driven methods in generating high-quality and artifact-free images. We further devise evaluation benchmarks tailored for this task. Through extensive qualitative and quantitative experiments, we demonstrate the generalizability of ArtiFade in effective artifact removal under both in-distribution and out-of-distribution scenarios.

ArtiFade: Learning to Generate High-quality Subject from Blemished Images

TL;DR

ArtiFade tackles blemished subject-driven generation by aligning unblemished and blemished training data and fine-tuning only selective diffusion-model components along with an artifact-free textual embedding. The method constructs paired data of unblemished and blemished images, applies Textual Inversion to obtain blemished embeddings, and optimizes a dedicated artifact-free embedding while fine-tuning cross-attention keys and values to reconstruct clean subject images. A bespoke evaluation benchmark and comprehensive experiments demonstrate superior artifact removal and subject fidelity in both in-distribution and out-of-distribution scenarios, including compatibility with DreamBooth via LoRA. The work offers a practical, generalizable solution for real-world image collections containing artifacts such as watermarks, stickers, or adversarial noise, enabling robust subject-driven generation in diverse settings.

Abstract

Subject-driven text-to-image generation has witnessed remarkable advancements in its ability to learn and capture characteristics of a subject using only a limited number of images. However, existing methods commonly rely on high-quality images for training and may struggle to generate reasonable images when the input images are blemished by artifacts. This is primarily attributed to the inadequate capability of current techniques in distinguishing subject-related features from disruptive artifacts. In this paper, we introduce ArtiFade to tackle this issue and successfully generate high-quality artifact-free images from blemished datasets. Specifically, ArtiFade exploits fine-tuning of a pre-trained text-to-image model, aiming to remove artifacts. The elimination of artifacts is achieved by utilizing a specialized dataset that encompasses both unblemished images and their corresponding blemished counterparts during fine-tuning. ArtiFade also ensures the preservation of the original generative capabilities inherent within the diffusion model, thereby enhancing the overall performance of subject-driven methods in generating high-quality and artifact-free images. We further devise evaluation benchmarks tailored for this task. Through extensive qualitative and quantitative experiments, we demonstrate the generalizability of ArtiFade in effective artifact removal under both in-distribution and out-of-distribution scenarios.
Paper Structure (47 sections, 6 equations, 22 figures, 5 tables)

This paper contains 47 sections, 6 equations, 22 figures, 5 tables.

Figures (22)

  • Figure 1: Blemished subject-driven generation with our ArtiFade and vanilla subject-driven methods. We display images generated using ArtiFade and Textual Inversion on watermark artifacts on the left, and ArtiFade and DreamBooth on adversarial noise artifacts van2023anti on the right. In contrast to the poor performance of Textual Inversion and DreamBooth, which are negatively affected by the visiable or invisible artifacts, ArtiFade produces much better fidelity of the subject with high-quality generation.
  • Figure 2: Overview of ArtiFade. On the left, we present Artifact Rectification Training, which involves an iterative process of calculating reconstruction loss between an unblemished image and the reconstruction of its blemished embedding. The right-hand side is the inference stage that tests ArtiFade on unseen blemished images. To avoid ambiguity, we (1) simplify the training of Textual Inversion into an input-output form, and (2) use "fine-tuning" and "inference" to respectively refer to the fine-tuning stage of ArtiFade and the use of ArtiFade for subject-driven generation.
  • Figure 3: Examples of training dataset $\mathcal{D}$ that contains both unblemished images and blemished counterparts.
  • Figure 4: Qualitative Comparison - ID. Unlike Textual Inversion which struggles to produce reasonable generation from blemished inputs, our method () consistently learns the distinguished features of the given subject and achieves high-quality generation without distortion.
  • Figure 5: Qualitative Comparison - OOD. Our method () is generalizable to process out-of-distribution artifacts that are unseen during the fine-tuning, demonstrating much better performance than Textual Inversion. Best viewed in PDF with zoom.
  • ...and 17 more figures