SOEDiff: Efficient Distillation for Small Object Editing
Yiming Wu, Qihe Pan, Zhen Zhao, Zicheng Wang, Sifan Long, Ronghua Liang
TL;DR
SOEDiff addresses the challenging task of editing very small objects within images by training a diffusion-based editor that concentrates learning on tiny regions. It introduces SO-LoRA to efficiently adapt pre-trained diffusion models and Cross-Scale Score Distillation to leverage high-resolution teacher predictions, combining these with VAE fine-tuning in a teacher-student framework. The method yields substantial gains in CLIP-Score and FID on OpenImage and MSCOCO small-object benchmarks, outperforming strong baselines, and demonstrates robustness via ablations and full-size evaluations. This approach enables accurate, text-guided modifications at a fine-grained spatial scale with improved fidelity and alignment, offering practical applications in precise image editing tasks.
Abstract
In this paper, we delve into a new task known as small object editing (SOE), which focuses on text-based image inpainting within a constrained, small-sized area. Despite the remarkable success have been achieved by current image inpainting approaches, their application to the SOE task generally results in failure cases such as Object Missing, Text-Image Mismatch, and Distortion. These failures stem from the limited use of small-sized objects in training datasets and the downsampling operations employed by U-Net models, which hinders accurate generation. To overcome these challenges, we introduce a novel training-based approach, SOEDiff, aimed at enhancing the capability of baseline models like StableDiffusion in editing small-sized objects while minimizing training costs. Specifically, our method involves two key components: SO-LoRA, which efficiently fine-tunes low-rank matrices, and Cross-Scale Score Distillation loss, which leverages high-resolution predictions from the pre-trained teacher diffusion model. Our method presents significant improvements on the test dataset collected from MSCOCO and OpenImage, validating the effectiveness of our proposed method in small object editing. In particular, when comparing SOEDiff with SD-I model on the OpenImage-f dataset, we observe a 0.99 improvement in CLIP-Score and a reduction of 2.87 in FID.
