LeftRefill: Filling Right Canvas based on Left Reference through Generalized Text-to-Image Diffusion Model
Chenjie Cao, Yunuo Cai, Qiaole Dong, Yikai Wang, Yanwei Fu
TL;DR
LeftRefill reframes reference-guided image synthesis as contextual inpainting by horizontally stitching a left reference with a right target. It uses task- and view-specific prompt tuning and reactivated cross- and self-attention within off-the-shelf diffusion models to achieve Ref-inpainting and novel view synthesis without test-time fine-tuning. The framework extends to multi-view settings with block causal masking to support autoregressive generation, enabling consistent multi-view outputs with minimal trainable parameters. Empirical results on MegaDepth, Objaverse, and Google Scanned Objects show improved spatial fidelity, robustness to multiple references, and faster convergence compared to baselines, highlighting practical advantages for reference-guided diffusion. The approach offers a lightweight, generalizable path for spatially precise, view-consistent image synthesis using existing diffusion models.
Abstract
This paper introduces LeftRefill, an innovative approach to efficiently harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis. As the name implies, LeftRefill horizontally stitches reference and target views together as a whole input. The reference image occupies the left side, while the target canvas is positioned on the right. Then, LeftRefill paints the right-side target canvas based on the left-side reference and specific task instructions. Such a task formulation shares some similarities with contextual inpainting, akin to the actions of a human painter. This novel formulation efficiently learns both structural and textured correspondence between reference and target without other image encoders or adapters. We inject task and view information through cross-attention modules in T2I models, and further exhibit multi-view reference ability via the re-arranged self-attention modules. These enable LeftRefill to perform consistent generation as a generalized model without requiring test-time fine-tuning or model modifications. Thus, LeftRefill can be seen as a simple yet unified framework to address reference-guided synthesis. As an exemplar, we leverage LeftRefill to address two different challenges: reference-guided inpainting and novel view synthesis, based on the pre-trained StableDiffusion. Codes and models are released at https://github.com/ewrfcas/LeftRefill.
