Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator

Chaehun Shin; Jooyoung Choi; Heeseung Kim; Sungroh Yoon

Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator

Chaehun Shin, Jooyoung Choi, Heeseung Kim, Sungroh Yoon

TL;DR

Subject-driven text-to-image generation without domain-specific fine-tuning is challenging due to alignment between subject fidelity and textual context. Diptych Prompting reframes zero-shot generation as diptych inpainting on a large-scale model (FLUX), using a background-removed left reference panel and a right inpainted panel guided by a descriptive diptych prompt, with cross-panel attention boosting to preserve fine subject details. Empirical results show strong subject and text alignment, outperforming encoder-based image prompting baselines and enabling stylized generation and subject-driven editing. The method offers a scalable, training-free route to high-fidelity subject rendering across contexts, with potential extensions to higher resolution, multi-subject scenarios, and other modalities like video or 3D.

Abstract

Subject-driven text-to-image generation aims to produce images of a new subject within a desired context by accurately capturing both the visual characteristics of the subject and the semantic content of a text prompt. Traditional methods rely on time- and resource-intensive fine-tuning for subject alignment, while recent zero-shot approaches leverage on-the-fly image prompting, often sacrificing subject alignment. In this paper, we introduce Diptych Prompting, a novel zero-shot approach that reinterprets as an inpainting task with precise subject alignment by leveraging the emergent property of diptych generation in large-scale text-to-image models. Diptych Prompting arranges an incomplete diptych with the reference image in the left panel, and performs text-conditioned inpainting on the right panel. We further prevent unwanted content leakage by removing the background in the reference image and improve fine-grained details in the generated subject by enhancing attention weights between the panels during inpainting. Experimental results confirm that our approach significantly outperforms zero-shot image prompting methods, resulting in images that are visually preferred by users. Additionally, our method supports not only subject-driven generation but also stylized image generation and subject-driven image editing, demonstrating versatility across diverse image generation applications. Project page: https://diptychprompting.github.io/

Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator

TL;DR

Abstract

Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)