Table of Contents
Fetching ...

DiffMorph: Text-less Image Morphing with Diffusion Models

Shounak Chatterjee

TL;DR

DiffMorph tackles the problem of artist-driven image morphing without textual prompts by conditioning an initial image with user sketches. It introduces a sketch-to-image pathway (ConditionFlow) that abandons certain spatial skips in ControlNet to emphasize sketch constraints, and it fine-tunes a pre-trained diffusion denoiser with a controlled reconstruction objective to merge multiple concepts. The key contributions are (1) a precise sketch-to-image generator, (2) a single-image-per-concept multi-concept fine-tuning strategy with area-coverage regularization, and (3) an evaluation showing efficient inference times and competitive quality against prompt-based customization methods. The work enables artists to generate high-fidelity morphs by combining concepts through sketches alone, reducing dependence on prompt engineering and expanding creative control.

Abstract

Text-conditioned image generation models are a prevalent use of AI image synthesis, yet intuitively controlling output guided by an artist remains challenging. Current methods require multiple images and textual prompts for each object to specify them as concepts to generate a single customized image. On the other hand, our work, \verb|DiffMorph|, introduces a novel approach that synthesizes images that mix concepts without the use of textual prompts. Our work integrates a sketch-to-image module to incorporate user sketches as input. \verb|DiffMorph| takes an initial image with conditioning artist-drawn sketches to generate a morphed image. We employ a pre-trained text-to-image diffusion model and fine-tune it to reconstruct each image faithfully. We seamlessly merge images and concepts from sketches into a cohesive composition. The image generation capability of our work is demonstrated through our results and a comparison of these with prompt-based image generation.

DiffMorph: Text-less Image Morphing with Diffusion Models

TL;DR

DiffMorph tackles the problem of artist-driven image morphing without textual prompts by conditioning an initial image with user sketches. It introduces a sketch-to-image pathway (ConditionFlow) that abandons certain spatial skips in ControlNet to emphasize sketch constraints, and it fine-tunes a pre-trained diffusion denoiser with a controlled reconstruction objective to merge multiple concepts. The key contributions are (1) a precise sketch-to-image generator, (2) a single-image-per-concept multi-concept fine-tuning strategy with area-coverage regularization, and (3) an evaluation showing efficient inference times and competitive quality against prompt-based customization methods. The work enables artists to generate high-fidelity morphs by combining concepts through sketches alone, reducing dependence on prompt engineering and expanding creative control.

Abstract

Text-conditioned image generation models are a prevalent use of AI image synthesis, yet intuitively controlling output guided by an artist remains challenging. Current methods require multiple images and textual prompts for each object to specify them as concepts to generate a single customized image. On the other hand, our work, \verb|DiffMorph|, introduces a novel approach that synthesizes images that mix concepts without the use of textual prompts. Our work integrates a sketch-to-image module to incorporate user sketches as input. \verb|DiffMorph| takes an initial image with conditioning artist-drawn sketches to generate a morphed image. We employ a pre-trained text-to-image diffusion model and fine-tune it to reconstruct each image faithfully. We seamlessly merge images and concepts from sketches into a cohesive composition. The image generation capability of our work is demonstrated through our results and a comparison of these with prompt-based image generation.
Paper Structure (12 sections, 7 equations, 8 figures, 2 tables)

This paper contains 12 sections, 7 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Our method, DiffMorph. Users can add a sketch of any object i.e. concept on top of an Image to generate a new image combined with the sketch. Here we used a panda plushie as a primary concept and added secondary concepts as sketches to transform the image with the provided sketches.
  • Figure 2: Overview of DiffMorph. The system takes an image and a sketch as a input and determines their respective classes using CLIP classifier radford2021learning. We use our ConditionFlow model to convert the sketch into an image. Subsequently, we fine-tune the Stable Diffusion Model Rombach_2022_CVPR with the images to generate a combined image.
  • Figure 3: Comparison of ControlNet vs ConditionFlow Architecture. The connections marked in red in 1.(a) are removed in 1.(b) as our suggested update.
  • Figure 4: We optimize the Stable Diffusion Model Rombach_2022_CVPR with the determined classes to reconstruct the input images. Form the classes we generate a relation between them, which is used to generate the output
  • Figure 5: Here we are comparing different settings we tested for better sketch-to-image generation. ControlNet and ConditionFlow columns represent the outputs generated from the baseline model and our modified model. We also tested 2 other configurations. Those are explained in the Section. \ref{['sec: condflow']}. We listed their outputs here.
  • ...and 3 more figures