DiffMorph: Text-less Image Morphing with Diffusion Models
Shounak Chatterjee
TL;DR
DiffMorph tackles the problem of artist-driven image morphing without textual prompts by conditioning an initial image with user sketches. It introduces a sketch-to-image pathway (ConditionFlow) that abandons certain spatial skips in ControlNet to emphasize sketch constraints, and it fine-tunes a pre-trained diffusion denoiser with a controlled reconstruction objective to merge multiple concepts. The key contributions are (1) a precise sketch-to-image generator, (2) a single-image-per-concept multi-concept fine-tuning strategy with area-coverage regularization, and (3) an evaluation showing efficient inference times and competitive quality against prompt-based customization methods. The work enables artists to generate high-fidelity morphs by combining concepts through sketches alone, reducing dependence on prompt engineering and expanding creative control.
Abstract
Text-conditioned image generation models are a prevalent use of AI image synthesis, yet intuitively controlling output guided by an artist remains challenging. Current methods require multiple images and textual prompts for each object to specify them as concepts to generate a single customized image. On the other hand, our work, \verb|DiffMorph|, introduces a novel approach that synthesizes images that mix concepts without the use of textual prompts. Our work integrates a sketch-to-image module to incorporate user sketches as input. \verb|DiffMorph| takes an initial image with conditioning artist-drawn sketches to generate a morphed image. We employ a pre-trained text-to-image diffusion model and fine-tune it to reconstruct each image faithfully. We seamlessly merge images and concepts from sketches into a cohesive composition. The image generation capability of our work is demonstrated through our results and a comparison of these with prompt-based image generation.
