Customize Your Own Paired Data via Few-shot Way

Jinshu Chen; Bingchuan Li; Miao Hua; Panpan Xu; Qian He

Customize Your Own Paired Data via Few-shot Way

Jinshu Chen, Bingchuan Li, Miao Hua, Panpan Xu, Qian He

TL;DR

This work tackles the problem of customized image editing with minimal paired data by introducing a directional, few-shot learning paradigm that expands the learnable space through intra-batch transformations. A diffusion-based editing pipeline is designed, with two submodels: $M_x$ learns explicit spatial and color transforms $F$ and $C$, while $M_y$ uses these conditions along with $x_j$ to generate $y_j$ via diffusion, guided by cross-attention to $[F,C]$. Key improvements include a trainable VAE, adaptive noises, skip connections, ViT backbones, and a frequency-domain loss, all contributing to high-fidelity edits. Experiments show strong data efficiency, achieving results competitive with or better than baselines using as little as 1–10% of the data, and capable of handling both known and user-defined editing targets across multiple resolutions, with robust qualitative and quantitative performance.

Abstract

Existing solutions to image editing tasks suffer from several issues. Though achieving remarkably satisfying generated results, some supervised methods require huge amounts of paired training data, which greatly limits their usages. The other unsupervised methods take full advantage of large-scale pre-trained priors, thus being strictly restricted to the domains where the priors are trained on and behaving badly in out-of-distribution cases. The task we focus on is how to enable the users to customize their desired effects through only few image pairs. In our proposed framework, a novel few-shot learning mechanism based on the directional transformations among samples is introduced and expands the learnable space exponentially. Adopting a diffusion model pipeline, we redesign the condition calculating modules in our model and apply several technical improvements. Experimental results demonstrate the capabilities of our method in various cases.

Customize Your Own Paired Data via Few-shot Way

TL;DR

learns explicit spatial and color transforms

and

, while

uses these conditions along with

to generate

via diffusion, guided by cross-attention to

. Key improvements include a trainable VAE, adaptive noises, skip connections, ViT backbones, and a frequency-domain loss, all contributing to high-fidelity edits. Experiments show strong data efficiency, achieving results competitive with or better than baselines using as little as 1–10% of the data, and capable of handling both known and user-defined editing targets across multiple resolutions, with robust qualitative and quantitative performance.

Abstract

Paper Structure (16 sections, 3 equations, 7 figures, 2 tables)

This paper contains 16 sections, 3 equations, 7 figures, 2 tables.

Introduction
Related Work
Generative models
Image editing
Method
Expansion methods for few-shot datasets
Approach
Technical details and improvements
Experiments
Generated results
Comparison
Ablation study
Effects of the data pairing mechanism
Effects of the condition
Ablation study of the technical tricks
...and 1 more sections

Figures (7)

Figure 1: Providing only few source-target image pairs, users are allowed to efficiently customize their own image editing models through our framework. Without affecting irrelevant attributes, our proposed method can capture the desired effects precisely. Our model can handle various editing cases, whether the editing targets are commonly known concepts or completely newly defined by users.
Figure 2: Instead of training the model on the paired samples themselves, we train the model on the directed transformations among samples from the same domain. In the figure we highlight the training objects in green to show the differences between ours and the existing method. In this way we expand the learnable space to an exponential extent approximately.
Figure 3: All modules of our framework are trained jointly and end-to-end. During training, all samples mentioned in the illustration are selected from the whole training dataset, while $\mathbf x_j$ is provided by users and will be transferred to get $\mathbf y_j$ for the inference time.
Figure 4: Essentially, our model is trained to find certain manifold where the projections of $\mathit f_x$ and $\mathit f_y$ are equal. Horizontal flip is shown on the upper left as a simple example, and normal cases are shown on the lower right.
Figure 5: Here we show more experimental results of our method. The editing target and the amount of given training pairs are recorded on the top of each result unit, and the relative generated images are displayed below. More visual results are shown in our appendix.
...and 2 more figures

Customize Your Own Paired Data via Few-shot Way

TL;DR

Abstract

Customize Your Own Paired Data via Few-shot Way

Authors

TL;DR

Abstract

Table of Contents

Figures (7)