Table of Contents
Fetching ...

Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator

Chaehun Shin, Jooyoung Choi, Heeseung Kim, Sungroh Yoon

TL;DR

Subject-driven text-to-image generation without domain-specific fine-tuning is challenging due to alignment between subject fidelity and textual context. Diptych Prompting reframes zero-shot generation as diptych inpainting on a large-scale model (FLUX), using a background-removed left reference panel and a right inpainted panel guided by a descriptive diptych prompt, with cross-panel attention boosting to preserve fine subject details. Empirical results show strong subject and text alignment, outperforming encoder-based image prompting baselines and enabling stylized generation and subject-driven editing. The method offers a scalable, training-free route to high-fidelity subject rendering across contexts, with potential extensions to higher resolution, multi-subject scenarios, and other modalities like video or 3D.

Abstract

Subject-driven text-to-image generation aims to produce images of a new subject within a desired context by accurately capturing both the visual characteristics of the subject and the semantic content of a text prompt. Traditional methods rely on time- and resource-intensive fine-tuning for subject alignment, while recent zero-shot approaches leverage on-the-fly image prompting, often sacrificing subject alignment. In this paper, we introduce Diptych Prompting, a novel zero-shot approach that reinterprets as an inpainting task with precise subject alignment by leveraging the emergent property of diptych generation in large-scale text-to-image models. Diptych Prompting arranges an incomplete diptych with the reference image in the left panel, and performs text-conditioned inpainting on the right panel. We further prevent unwanted content leakage by removing the background in the reference image and improve fine-grained details in the generated subject by enhancing attention weights between the panels during inpainting. Experimental results confirm that our approach significantly outperforms zero-shot image prompting methods, resulting in images that are visually preferred by users. Additionally, our method supports not only subject-driven generation but also stylized image generation and subject-driven image editing, demonstrating versatility across diverse image generation applications. Project page: https://diptychprompting.github.io/

Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator

TL;DR

Subject-driven text-to-image generation without domain-specific fine-tuning is challenging due to alignment between subject fidelity and textual context. Diptych Prompting reframes zero-shot generation as diptych inpainting on a large-scale model (FLUX), using a background-removed left reference panel and a right inpainted panel guided by a descriptive diptych prompt, with cross-panel attention boosting to preserve fine subject details. Empirical results show strong subject and text alignment, outperforming encoder-based image prompting baselines and enabling stylized generation and subject-driven editing. The method offers a scalable, training-free route to high-fidelity subject rendering across contexts, with potential extensions to higher resolution, multi-subject scenarios, and other modalities like video or 3D.

Abstract

Subject-driven text-to-image generation aims to produce images of a new subject within a desired context by accurately capturing both the visual characteristics of the subject and the semantic content of a text prompt. Traditional methods rely on time- and resource-intensive fine-tuning for subject alignment, while recent zero-shot approaches leverage on-the-fly image prompting, often sacrificing subject alignment. In this paper, we introduce Diptych Prompting, a novel zero-shot approach that reinterprets as an inpainting task with precise subject alignment by leveraging the emergent property of diptych generation in large-scale text-to-image models. Diptych Prompting arranges an incomplete diptych with the reference image in the left panel, and performs text-conditioned inpainting on the right panel. We further prevent unwanted content leakage by removing the background in the reference image and improve fine-grained details in the generated subject by enhancing attention weights between the panels during inpainting. Experimental results confirm that our approach significantly outperforms zero-shot image prompting methods, resulting in images that are visually preferred by users. Additionally, our method supports not only subject-driven generation but also stylized image generation and subject-driven image editing, demonstrating versatility across diverse image generation applications. Project page: https://diptychprompting.github.io/

Paper Structure

This paper contains 28 sections, 7 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Given a single reference image, our Diptych Prompting performs zero-shot subject-driven text-to-image generation through diptych inpainting. Building on the (a) diptych generation capability of FLUX flux1-dev, we extend it to diptych inpainting with a separate module, resulting in (b) versatility across various tasks including subject-driven text-to-image generation, stylized image generation, and subject-driven image editing.
  • Figure 2: Diptych Generation Comparisons. We generate the diptych images with various TTI models from the following diptych text: "A diptych with two side-by-side images of same cat. On the left, a photo of a cat in front of Eiffel Tower. On the right, replicate this cat exactly but as a photo of a cat in the jungle".
  • Figure 3: (a) Overall Diptych Prompting Framework. Given the incomplete diptych $I_{\text{diptych}}$, text prompt $T_{\text{diptych}}$ describing the diptych, and the binary mask $M_{\text{diptych}}$ specifying the right panel as the inpainting target, FLUX with ControlNet module performs text-conditioned inpainting on the right panel while referencing the subject in the left panel. (b) Reference Attention Enhancement. To capture the granular details of the subject in left panel, we enhance the reference attention, an attention weight between the query of the right panel and the key of the left panel.
  • Figure 4: Background Removal Effects. Simple diptych inpainting exhibits content leakage from the reference image, including background, pose, and location. We mitigate this unwanted leakage through background removal by $G_{\text{seg}}$.
  • Figure 5: Qualitative Comparisons. Please zoom in for a more detailed view and better comparison.
  • ...and 9 more figures