Table of Contents
Fetching ...

Customize Your Own Paired Data via Few-shot Way

Jinshu Chen, Bingchuan Li, Miao Hua, Panpan Xu, Qian He

TL;DR

This work tackles the problem of customized image editing with minimal paired data by introducing a directional, few-shot learning paradigm that expands the learnable space through intra-batch transformations. A diffusion-based editing pipeline is designed, with two submodels: $M_x$ learns explicit spatial and color transforms $F$ and $C$, while $M_y$ uses these conditions along with $x_j$ to generate $y_j$ via diffusion, guided by cross-attention to $[F,C]$. Key improvements include a trainable VAE, adaptive noises, skip connections, ViT backbones, and a frequency-domain loss, all contributing to high-fidelity edits. Experiments show strong data efficiency, achieving results competitive with or better than baselines using as little as 1–10% of the data, and capable of handling both known and user-defined editing targets across multiple resolutions, with robust qualitative and quantitative performance.

Abstract

Existing solutions to image editing tasks suffer from several issues. Though achieving remarkably satisfying generated results, some supervised methods require huge amounts of paired training data, which greatly limits their usages. The other unsupervised methods take full advantage of large-scale pre-trained priors, thus being strictly restricted to the domains where the priors are trained on and behaving badly in out-of-distribution cases. The task we focus on is how to enable the users to customize their desired effects through only few image pairs. In our proposed framework, a novel few-shot learning mechanism based on the directional transformations among samples is introduced and expands the learnable space exponentially. Adopting a diffusion model pipeline, we redesign the condition calculating modules in our model and apply several technical improvements. Experimental results demonstrate the capabilities of our method in various cases.

Customize Your Own Paired Data via Few-shot Way

TL;DR

This work tackles the problem of customized image editing with minimal paired data by introducing a directional, few-shot learning paradigm that expands the learnable space through intra-batch transformations. A diffusion-based editing pipeline is designed, with two submodels: learns explicit spatial and color transforms and , while uses these conditions along with to generate via diffusion, guided by cross-attention to . Key improvements include a trainable VAE, adaptive noises, skip connections, ViT backbones, and a frequency-domain loss, all contributing to high-fidelity edits. Experiments show strong data efficiency, achieving results competitive with or better than baselines using as little as 1–10% of the data, and capable of handling both known and user-defined editing targets across multiple resolutions, with robust qualitative and quantitative performance.

Abstract

Existing solutions to image editing tasks suffer from several issues. Though achieving remarkably satisfying generated results, some supervised methods require huge amounts of paired training data, which greatly limits their usages. The other unsupervised methods take full advantage of large-scale pre-trained priors, thus being strictly restricted to the domains where the priors are trained on and behaving badly in out-of-distribution cases. The task we focus on is how to enable the users to customize their desired effects through only few image pairs. In our proposed framework, a novel few-shot learning mechanism based on the directional transformations among samples is introduced and expands the learnable space exponentially. Adopting a diffusion model pipeline, we redesign the condition calculating modules in our model and apply several technical improvements. Experimental results demonstrate the capabilities of our method in various cases.
Paper Structure (16 sections, 3 equations, 7 figures, 2 tables)

This paper contains 16 sections, 3 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Providing only few source-target image pairs, users are allowed to efficiently customize their own image editing models through our framework. Without affecting irrelevant attributes, our proposed method can capture the desired effects precisely. Our model can handle various editing cases, whether the editing targets are commonly known concepts or completely newly defined by users.
  • Figure 2: Instead of training the model on the paired samples themselves, we train the model on the directed transformations among samples from the same domain. In the figure we highlight the training objects in green to show the differences between ours and the existing method. In this way we expand the learnable space to an exponential extent approximately.
  • Figure 3: All modules of our framework are trained jointly and end-to-end. During training, all samples mentioned in the illustration are selected from the whole training dataset, while $\mathbf x_j$ is provided by users and will be transferred to get $\mathbf y_j$ for the inference time.
  • Figure 4: Essentially, our model is trained to find certain manifold where the projections of $\mathit f_x$ and $\mathit f_y$ are equal. Horizontal flip is shown on the upper left as a simple example, and normal cases are shown on the lower right.
  • Figure 5: Here we show more experimental results of our method. The editing target and the amount of given training pairs are recorded on the top of each result unit, and the relative generated images are displayed below. More visual results are shown in our appendix.
  • ...and 2 more figures