Table of Contents
Fetching ...

LeftRefill: Filling Right Canvas based on Left Reference through Generalized Text-to-Image Diffusion Model

Chenjie Cao, Yunuo Cai, Qiaole Dong, Yikai Wang, Yanwei Fu

TL;DR

LeftRefill reframes reference-guided image synthesis as contextual inpainting by horizontally stitching a left reference with a right target. It uses task- and view-specific prompt tuning and reactivated cross- and self-attention within off-the-shelf diffusion models to achieve Ref-inpainting and novel view synthesis without test-time fine-tuning. The framework extends to multi-view settings with block causal masking to support autoregressive generation, enabling consistent multi-view outputs with minimal trainable parameters. Empirical results on MegaDepth, Objaverse, and Google Scanned Objects show improved spatial fidelity, robustness to multiple references, and faster convergence compared to baselines, highlighting practical advantages for reference-guided diffusion. The approach offers a lightweight, generalizable path for spatially precise, view-consistent image synthesis using existing diffusion models.

Abstract

This paper introduces LeftRefill, an innovative approach to efficiently harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis. As the name implies, LeftRefill horizontally stitches reference and target views together as a whole input. The reference image occupies the left side, while the target canvas is positioned on the right. Then, LeftRefill paints the right-side target canvas based on the left-side reference and specific task instructions. Such a task formulation shares some similarities with contextual inpainting, akin to the actions of a human painter. This novel formulation efficiently learns both structural and textured correspondence between reference and target without other image encoders or adapters. We inject task and view information through cross-attention modules in T2I models, and further exhibit multi-view reference ability via the re-arranged self-attention modules. These enable LeftRefill to perform consistent generation as a generalized model without requiring test-time fine-tuning or model modifications. Thus, LeftRefill can be seen as a simple yet unified framework to address reference-guided synthesis. As an exemplar, we leverage LeftRefill to address two different challenges: reference-guided inpainting and novel view synthesis, based on the pre-trained StableDiffusion. Codes and models are released at https://github.com/ewrfcas/LeftRefill.

LeftRefill: Filling Right Canvas based on Left Reference through Generalized Text-to-Image Diffusion Model

TL;DR

LeftRefill reframes reference-guided image synthesis as contextual inpainting by horizontally stitching a left reference with a right target. It uses task- and view-specific prompt tuning and reactivated cross- and self-attention within off-the-shelf diffusion models to achieve Ref-inpainting and novel view synthesis without test-time fine-tuning. The framework extends to multi-view settings with block causal masking to support autoregressive generation, enabling consistent multi-view outputs with minimal trainable parameters. Empirical results on MegaDepth, Objaverse, and Google Scanned Objects show improved spatial fidelity, robustness to multiple references, and faster convergence compared to baselines, highlighting practical advantages for reference-guided diffusion. The approach offers a lightweight, generalizable path for spatially precise, view-consistent image synthesis using existing diffusion models.

Abstract

This paper introduces LeftRefill, an innovative approach to efficiently harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis. As the name implies, LeftRefill horizontally stitches reference and target views together as a whole input. The reference image occupies the left side, while the target canvas is positioned on the right. Then, LeftRefill paints the right-side target canvas based on the left-side reference and specific task instructions. Such a task formulation shares some similarities with contextual inpainting, akin to the actions of a human painter. This novel formulation efficiently learns both structural and textured correspondence between reference and target without other image encoders or adapters. We inject task and view information through cross-attention modules in T2I models, and further exhibit multi-view reference ability via the re-arranged self-attention modules. These enable LeftRefill to perform consistent generation as a generalized model without requiring test-time fine-tuning or model modifications. Thus, LeftRefill can be seen as a simple yet unified framework to address reference-guided synthesis. As an exemplar, we leverage LeftRefill to address two different challenges: reference-guided inpainting and novel view synthesis, based on the pre-trained StableDiffusion. Codes and models are released at https://github.com/ewrfcas/LeftRefill.
Paper Structure (26 sections, 2 equations, 29 figures, 14 tables)

This paper contains 26 sections, 2 equations, 29 figures, 14 tables.

Figures (29)

  • Figure 1: LetRefill addresses the generation on the right canvas conditioned by left references. We can re-formulate several existing tasks in the LeftRefill manner, including (a) reference-guided inpainting, (b) novel view synthesis. The reference and target can be further extended to multi-view scenes, forming (c) multi-view reference inpainting and (d) multi-view synthesis, respectively. Green frames indicate stitched inputs. Reference views are placed on the left side, while masked target views are placed on the right side. Violet frames only show enlarged right-side generations produced by LeftRefill. Note that we omit some input details in (c) and (d) for simplicity.
  • Figure 2: (a) The overview of LeftRefill. Inputs of Ref-inpainting and NVS are shown in (b). Task and view prompt embedding and pose features (optional for NVS) are infused to CLIP-H for cross-attention learning in U-net. For the output of LeftRefill, we discard the left-side reference and take the right-side generation.
  • Figure 3: Illustration about multi-view training inputs ($v\times H\times 2W$, $v=4$) of LeftRefill, where $v,H,2W$ indicate the view number, height, and width of stitching images. All views of Ref-inpainting (a) share the same masked target, while the multi-view NVS (b) should be trained with the AR generation.
  • Figure 4: Detailed architecture of LeftRefill for multi-view synthesis. Both CNN and cross-attention modules are encoded separately for each stitched view, while all views share the same self-attention for multi-view correlation learning.
  • Figure 5: (a) Feature rearranging, and (b) block causal masking of LeftRefill, where $b,v,h,w,c$ indicate the batch size, view number, height, width, and channels of features, where $w$ is the width of stitching features (downsampled from $2W$).
  • ...and 24 more figures