Table of Contents
Fetching ...

NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis

Chenfei Wu, Jian Liang, Xiaowei Hu, Zhe Gan, Jianfeng Wang, Lijuan Wang, Zicheng Liu, Yuejian Fang, Nan Duan

TL;DR

NUWA-Infinity tackles the challenge of infinite visual synthesis by introducing an autoregressive over autoregressive framework that generates variable-size images and long videos. It couples a global patch-level autoregressive model with a local token-level autoregressive model, augmented by a Nearby Context Pool for memory efficiency and an Arbitrary Direction Controller for flexible patch ordering and dynamic positional embeddings. The approach supports five high-definition tasks and demonstrates strong performance on extremely large outputs, image outpainting, and long-duration videos, outperforming several fixed-size and patch-based baselines. The work offers practical advances for high-resolution content creation and suggests directions for more scalable, order-aware visual synthesis systems.

Abstract

In this paper, we present NUWA-Infinity, a generative model for infinite visual synthesis, which is defined as the task of generating arbitrarily-sized high-resolution images or long-duration videos. An autoregressive over autoregressive generation mechanism is proposed to deal with this variable-size generation task, where a global patch-level autoregressive model considers the dependencies between patches, and a local token-level autoregressive model considers dependencies between visual tokens within each patch. A Nearby Context Pool (NCP) is introduced to cache-related patches already generated as the context for the current patch being generated, which can significantly save computation costs without sacrificing patch-level dependency modeling. An Arbitrary Direction Controller (ADC) is used to decide suitable generation orders for different visual synthesis tasks and learn order-aware positional embeddings. Compared to DALL-E, Imagen and Parti, NUWA-Infinity can generate high-resolution images with arbitrary sizes and support long-duration video generation additionally. Compared to NUWA, which also covers images and videos, NUWA-Infinity has superior visual synthesis capabilities in terms of resolution and variable-size generation. The GitHub link is https://github.com/microsoft/NUWA. The homepage link is https://nuwa-infinity.microsoft.com.

NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis

TL;DR

NUWA-Infinity tackles the challenge of infinite visual synthesis by introducing an autoregressive over autoregressive framework that generates variable-size images and long videos. It couples a global patch-level autoregressive model with a local token-level autoregressive model, augmented by a Nearby Context Pool for memory efficiency and an Arbitrary Direction Controller for flexible patch ordering and dynamic positional embeddings. The approach supports five high-definition tasks and demonstrates strong performance on extremely large outputs, image outpainting, and long-duration videos, outperforming several fixed-size and patch-based baselines. The work offers practical advances for high-resolution content creation and suggests directions for more scalable, order-aware visual synthesis systems.

Abstract

In this paper, we present NUWA-Infinity, a generative model for infinite visual synthesis, which is defined as the task of generating arbitrarily-sized high-resolution images or long-duration videos. An autoregressive over autoregressive generation mechanism is proposed to deal with this variable-size generation task, where a global patch-level autoregressive model considers the dependencies between patches, and a local token-level autoregressive model considers dependencies between visual tokens within each patch. A Nearby Context Pool (NCP) is introduced to cache-related patches already generated as the context for the current patch being generated, which can significantly save computation costs without sacrificing patch-level dependency modeling. An Arbitrary Direction Controller (ADC) is used to decide suitable generation orders for different visual synthesis tasks and learn order-aware positional embeddings. Compared to DALL-E, Imagen and Parti, NUWA-Infinity can generate high-resolution images with arbitrary sizes and support long-duration video generation additionally. Compared to NUWA, which also covers images and videos, NUWA-Infinity has superior visual synthesis capabilities in terms of resolution and variable-size generation. The GitHub link is https://github.com/microsoft/NUWA. The homepage link is https://nuwa-infinity.microsoft.com.
Paper Structure (36 sections, 9 equations, 11 figures, 7 tables, 2 algorithms)

This paper contains 36 sections, 9 equations, 11 figures, 7 tables, 2 algorithms.

Figures (11)

  • Figure 5: An overview of the proposed NUWA-Infinity model during the training process.
  • Figure 6: Illustration of patch order control in NUWA-Infinity. The left part shows four basic patch generation orders ($\omega, \omega^*,\zeta, \zeta^*$) during training. The right part shows how NUWA-Infinity performs the image outpainting task by composing these four orders. Arabic numerals indicate the order of global autoregression, arrows indicate the order of local autoregression.
  • Figure 8: Illustration of NCP in $\omega$-order with a context extent of (1,1,1).
  • Figure 9: Pixel-Guided VQGAN.
  • Figure 10: Inference pipeline of NUWA-Infinity for downstream tasks.
  • ...and 6 more figures