Table of Contents
Fetching ...

SOEDiff: Efficient Distillation for Small Object Editing

Yiming Wu, Qihe Pan, Zhen Zhao, Zicheng Wang, Sifan Long, Ronghua Liang

TL;DR

SOEDiff addresses the challenging task of editing very small objects within images by training a diffusion-based editor that concentrates learning on tiny regions. It introduces SO-LoRA to efficiently adapt pre-trained diffusion models and Cross-Scale Score Distillation to leverage high-resolution teacher predictions, combining these with VAE fine-tuning in a teacher-student framework. The method yields substantial gains in CLIP-Score and FID on OpenImage and MSCOCO small-object benchmarks, outperforming strong baselines, and demonstrates robustness via ablations and full-size evaluations. This approach enables accurate, text-guided modifications at a fine-grained spatial scale with improved fidelity and alignment, offering practical applications in precise image editing tasks.

Abstract

In this paper, we delve into a new task known as small object editing (SOE), which focuses on text-based image inpainting within a constrained, small-sized area. Despite the remarkable success have been achieved by current image inpainting approaches, their application to the SOE task generally results in failure cases such as Object Missing, Text-Image Mismatch, and Distortion. These failures stem from the limited use of small-sized objects in training datasets and the downsampling operations employed by U-Net models, which hinders accurate generation. To overcome these challenges, we introduce a novel training-based approach, SOEDiff, aimed at enhancing the capability of baseline models like StableDiffusion in editing small-sized objects while minimizing training costs. Specifically, our method involves two key components: SO-LoRA, which efficiently fine-tunes low-rank matrices, and Cross-Scale Score Distillation loss, which leverages high-resolution predictions from the pre-trained teacher diffusion model. Our method presents significant improvements on the test dataset collected from MSCOCO and OpenImage, validating the effectiveness of our proposed method in small object editing. In particular, when comparing SOEDiff with SD-I model on the OpenImage-f dataset, we observe a 0.99 improvement in CLIP-Score and a reduction of 2.87 in FID.

SOEDiff: Efficient Distillation for Small Object Editing

TL;DR

SOEDiff addresses the challenging task of editing very small objects within images by training a diffusion-based editor that concentrates learning on tiny regions. It introduces SO-LoRA to efficiently adapt pre-trained diffusion models and Cross-Scale Score Distillation to leverage high-resolution teacher predictions, combining these with VAE fine-tuning in a teacher-student framework. The method yields substantial gains in CLIP-Score and FID on OpenImage and MSCOCO small-object benchmarks, outperforming strong baselines, and demonstrates robustness via ablations and full-size evaluations. This approach enables accurate, text-guided modifications at a fine-grained spatial scale with improved fidelity and alignment, offering practical applications in precise image editing tasks.

Abstract

In this paper, we delve into a new task known as small object editing (SOE), which focuses on text-based image inpainting within a constrained, small-sized area. Despite the remarkable success have been achieved by current image inpainting approaches, their application to the SOE task generally results in failure cases such as Object Missing, Text-Image Mismatch, and Distortion. These failures stem from the limited use of small-sized objects in training datasets and the downsampling operations employed by U-Net models, which hinders accurate generation. To overcome these challenges, we introduce a novel training-based approach, SOEDiff, aimed at enhancing the capability of baseline models like StableDiffusion in editing small-sized objects while minimizing training costs. Specifically, our method involves two key components: SO-LoRA, which efficiently fine-tunes low-rank matrices, and Cross-Scale Score Distillation loss, which leverages high-resolution predictions from the pre-trained teacher diffusion model. Our method presents significant improvements on the test dataset collected from MSCOCO and OpenImage, validating the effectiveness of our proposed method in small object editing. In particular, when comparing SOEDiff with SD-I model on the OpenImage-f dataset, we observe a 0.99 improvement in CLIP-Score and a reduction of 2.87 in FID.
Paper Structure (15 sections, 8 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 15 sections, 8 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: The image showcases examples of small object editing. The first column shows input images along with the masked small areas highlighted within red bounding boxes. The second column depicts results generated by SD-I rombach2022high. In the third column, results produced by SD-XL podell2023sdxl are presented. The fourth column features outcomes generated by our proposed SOEDiff. "brown, red, orange, white, and blue" are colors for editing objects (i.e., dog, goose, stop sign, etc). To have a better view, please visit our project page https://soediff.github.io.
  • Figure 2: (a) Challenges in editing small objects. Images are presented along with their corresponding masks, edited images, and enlarged areas in three separate columns. Three major challenges are presented: Object Missing, the model fails to generate the object as described in the text description; T2I Mismatch, the discrepancy between the generated content and the textual description, particularly in attributions like color or shape; Distortion, the generated object appears distorted, e.g., the essential features like the cat's face are missing in this image. The images are generated by SD-I. (b) The illustration of a cross-attention map. For an input image with a size of $512 \times 512$, if the masked area is $64 \times 64$, the corresponding effective area comes to $1 \times 1$ in the mid-block. The diminutive size of this masked area poses a challenge as it may lack sufficient semantic information essential for generating associated objects. Zoom in for a better view.
  • Figure 3: Overview of our proposed SOEDiff. The student diffusion receives image with smaller mask $x$, mask $m$, text prompt $c$ as input and the teacher diffusion receives cropped larger-sized mask image $x'$, mask $m'$, as input to optimize three objectives: (a) denoising loss: the student model aims to predict the noise $\epsilon_\theta$ to match the noise added $\epsilon$ in the diffusion forward process. (b) distillation loss: the student model is trained to generate the same content within the mask region as generated by the teacher model. (c) reconstruction loss: the VAE model is trained to reduce information loss in small target regions.
  • Figure 4: Qualitative comparison of different components: the top row displays masked images and their corresponding text prompts. The second row shows edited images generated by SD-I. In the third and fourth rows, results from SD+SO-LoRA and SOEDiff are presented.
  • Figure 5: Extended application with our proposed SOEDiff. The first row shows the original images, the second row displays the results of object erasing, and rows three to six depict the results of object replacement.
  • ...and 1 more figures