Table of Contents
Fetching ...

Continuous-Multiple Image Outpainting in One-Step via Positional Query and A Diffusion-based Approach

Shaofeng Zhang, Jinfa Huang, Qiang Zhou, Zhibin Wang, Fan Wang, Jiebo Luo, Junchi Yan

TL;DR

PQDiff tackles image outpainting for arbitrary continuous expansion multiples in one step without backbone pretraining. It introduces a diffusion-based framework conditioned on randomly cropped anchor/target views via relative positional embeddings and a position-aware transformer to learn both content and spatial relationships. It achieves state-of-the-art FID on Scenery, Building Facades, and WikiArts, while enabling faster sampling for large multiples, demonstrating strong practical applicability. The method advances flexible, efficient outpainting for creative content generation and suggests broad potential extensions to other conditional diffusion tasks.

Abstract

Image outpainting aims to generate the content of an input sub-image beyond its original boundaries. It is an important task in content generation yet remains an open problem for generative models. This paper pushes the technical frontier of image outpainting in two directions that have not been resolved in literature: 1) outpainting with arbitrary and continuous multiples (without restriction), and 2) outpainting in a single step (even for large expansion multiples). Moreover, we develop a method that does not depend on a pre-trained backbone network, which is in contrast commonly required by the previous SOTA outpainting methods. The arbitrary multiple outpainting is achieved by utilizing randomly cropped views from the same image during training to capture arbitrary relative positional information. Specifically, by feeding one view and positional embeddings as queries, we can reconstruct another view. At inference, we generate images with arbitrary expansion multiples by inputting an anchor image and its corresponding positional embeddings. The one-step outpainting ability here is particularly noteworthy in contrast to previous methods that need to be performed for $N$ times to obtain a final multiple which is $N$ times of its basic and fixed multiple. We evaluate the proposed approach (called PQDiff as we adopt a diffusion-based generator as our embodiment, under our proposed \textbf{P}ositional \textbf{Q}uery scheme) on public benchmarks, demonstrating its superior performance over state-of-the-art approaches. Specifically, PQDiff achieves state-of-the-art FID scores on the Scenery (\textbf{21.512}), Building Facades (\textbf{25.310}), and WikiArts (\textbf{36.212}) datasets. Furthermore, under the 2.25x, 5x and 11.7x outpainting settings, PQDiff only takes \textbf{40.6\%}, \textbf{20.3\%} and \textbf{10.2\%} of the time of the benchmark state-of-the-art (SOTA) method.

Continuous-Multiple Image Outpainting in One-Step via Positional Query and A Diffusion-based Approach

TL;DR

PQDiff tackles image outpainting for arbitrary continuous expansion multiples in one step without backbone pretraining. It introduces a diffusion-based framework conditioned on randomly cropped anchor/target views via relative positional embeddings and a position-aware transformer to learn both content and spatial relationships. It achieves state-of-the-art FID on Scenery, Building Facades, and WikiArts, while enabling faster sampling for large multiples, demonstrating strong practical applicability. The method advances flexible, efficient outpainting for creative content generation and suggests broad potential extensions to other conditional diffusion tasks.

Abstract

Image outpainting aims to generate the content of an input sub-image beyond its original boundaries. It is an important task in content generation yet remains an open problem for generative models. This paper pushes the technical frontier of image outpainting in two directions that have not been resolved in literature: 1) outpainting with arbitrary and continuous multiples (without restriction), and 2) outpainting in a single step (even for large expansion multiples). Moreover, we develop a method that does not depend on a pre-trained backbone network, which is in contrast commonly required by the previous SOTA outpainting methods. The arbitrary multiple outpainting is achieved by utilizing randomly cropped views from the same image during training to capture arbitrary relative positional information. Specifically, by feeding one view and positional embeddings as queries, we can reconstruct another view. At inference, we generate images with arbitrary expansion multiples by inputting an anchor image and its corresponding positional embeddings. The one-step outpainting ability here is particularly noteworthy in contrast to previous methods that need to be performed for times to obtain a final multiple which is times of its basic and fixed multiple. We evaluate the proposed approach (called PQDiff as we adopt a diffusion-based generator as our embodiment, under our proposed \textbf{P}ositional \textbf{Q}uery scheme) on public benchmarks, demonstrating its superior performance over state-of-the-art approaches. Specifically, PQDiff achieves state-of-the-art FID scores on the Scenery (\textbf{21.512}), Building Facades (\textbf{25.310}), and WikiArts (\textbf{36.212}) datasets. Furthermore, under the 2.25x, 5x and 11.7x outpainting settings, PQDiff only takes \textbf{40.6\%}, \textbf{20.3\%} and \textbf{10.2\%} of the time of the benchmark state-of-the-art (SOTA) method.
Paper Structure (19 sections, 9 equations, 14 figures, 10 tables, 2 algorithms)

This paper contains 19 sections, 9 equations, 14 figures, 10 tables, 2 algorithms.

Figures (14)

  • Figure 1: PQDiff can outpaint images with arbitrary and continuous multiples in one step (b). In contrast, previous methods (a) outpaint images with discrete multiples in multiple steps. Note that $N$x here means an $N$-times larger image needs to be generated, while 2.25x, 5x, and 11.7x are adopted in the experiment following the setting of the previous work queryqtr for fair comparisons.
  • Figure 2: Framework of PQDiff. RPE in Eq. \ref{['eqn:pos_embed']} means relative positional embeddings (we give the pseudo-code to calculate the RPE in Appendix \ref{['sec:alg']}). For training, we randomly crop the image twice with different random crop ratios to obtain two views. Then, we compute the relative positional embeddings of the anchor view (red box) and the target view (blue box). For sampling, i.e. testing or generation, we first compute the target view (blue box) based on the anchor view (red box) to form a mode that means a positional relation. With different types of modes, we can perform arbitrary and controllable image outpainting. Then, we feed the RPE, random Gaussian noise, and input sub-image to perform outpainting. In theory, our PQDiff can outpaint (predict) the region at any location, due to the randomness of cropping in the training stage. We illustrate how to calculate the relative position in Appendix \ref{['sec:pos']}. Mode means the positional relations between the anchor view and the target view.
  • Figure 3: Comparison on the 2.25x, 5x, and 11.7x settings with the SOTA method QueryOTR. The images generated by QueryOTR come from the pre-trained model in their official repository. We highlight two kinds of noises from QueryOTR. The red box indicates that the boundary of the input sub-image is inconsistent with the generated region, and a yellow box contains noise and spots. We also find some interesting phenomena (highlighted in the green ovals of generated images in the right figure), where PQDiff can notice the generated "clouds", and reflect the "clouds" in "water". Moreover, the shape of the clouds in the sky and the reflections in the water are also consistent. In contrast, the previous method only generates "clouds", but ignores the reflection in the water.
  • Figure 4: Example images generated by PQDiff via random relative positional embedding. The original image (Ori) as input is mapped to the corresponding location in the generated image (Gen). Note that the generated images do not explicitly undergo any "copy" operation, but still reserve the input pixels. Moreover, PQDiff learns the scales of input images according to the given mode setting.
  • Figure 5: Inception scores on the Scenery dataset with different random crop ratios of anchor view $\mathbf{x}_{a}$ over 2.25x (left), 5x (middle), and 11.7x (right). In some cases, PQDiff outperforms PQDiff (Default), as we do not heavily tune the hyper-parameters in the default version. The query crop ratio is randomly sampled from $r\sim 0.5$, where $r$ is the values in the horizontal axis of the plots.
  • ...and 9 more figures