Table of Contents
Fetching ...

NAMI: Efficient Image Generation via Bridged Progressive Rectified Flow Transformers

Yuhang Ma, Bo Cheng, Shanyuan Liu, Hongyi Zhou, Liebucha Wu, Dawei Leng, Yuhui Yin

TL;DR

Bridged Progressive Rectified Flow Transformers (NAMI) is introduced, which decompose the generation process across temporal, spatial, and architectural demensions, and enables multi-resolution training, accelerating model convergence.

Abstract

Flow-based Transformer models have achieved state-of-the-art image generation performance, but often suffer from high inference latency and computational cost due to their large parameter sizes. To improve inference efficiency without compromising quality, we propose Bridged Progressive Rectified Flow Transformers (NAMI), which decompose the generation process across temporal, spatial, and architectural demensions. We divide the rectified flow into different stages according to resolution, and use a BridgeFlow module to connect them. Fewer Transformer layers are used at low-resolution stages to generate image layouts and concept contours, and more layers are progressively added as the resolution increases. Experiments demonstrate that our approach achieves fast convergence and reduces inference time while ensuring generation quality. The main contributions of this paper are summarized as follows: (1) We introduce Bridged Progressive Rectified Flow Transformers that enable multi-resolution training, accelerating model convergence; (2) NAMI leverages piecewise flow and spatial cascading of Diffusion Transformer (DiT) to rapidly generate images, reducing inference time by 64% for generating 1024 resolution images; (3) We propose a BridgeFlow module to align flows between different stages; (4) We propose the NAMI-1K benchmark to evaluate human preference performance, aiming to mitigate distributional bias and comprehensively assess model effectiveness. The results show that our model is competitive with state-of-the-art models.

NAMI: Efficient Image Generation via Bridged Progressive Rectified Flow Transformers

TL;DR

Bridged Progressive Rectified Flow Transformers (NAMI) is introduced, which decompose the generation process across temporal, spatial, and architectural demensions, and enables multi-resolution training, accelerating model convergence.

Abstract

Flow-based Transformer models have achieved state-of-the-art image generation performance, but often suffer from high inference latency and computational cost due to their large parameter sizes. To improve inference efficiency without compromising quality, we propose Bridged Progressive Rectified Flow Transformers (NAMI), which decompose the generation process across temporal, spatial, and architectural demensions. We divide the rectified flow into different stages according to resolution, and use a BridgeFlow module to connect them. Fewer Transformer layers are used at low-resolution stages to generate image layouts and concept contours, and more layers are progressively added as the resolution increases. Experiments demonstrate that our approach achieves fast convergence and reduces inference time while ensuring generation quality. The main contributions of this paper are summarized as follows: (1) We introduce Bridged Progressive Rectified Flow Transformers that enable multi-resolution training, accelerating model convergence; (2) NAMI leverages piecewise flow and spatial cascading of Diffusion Transformer (DiT) to rapidly generate images, reducing inference time by 64% for generating 1024 resolution images; (3) We propose a BridgeFlow module to align flows between different stages; (4) We propose the NAMI-1K benchmark to evaluate human preference performance, aiming to mitigate distributional bias and comprehensively assess model effectiveness. The results show that our model is competitive with state-of-the-art models.

Paper Structure

This paper contains 16 sections, 8 equations, 28 figures, 7 tables, 1 algorithm.

Figures (28)

  • Figure 1: High-quality image synthesis results from NAMI-2B demonstrate its capabilities in precise prompt following, spatial reasoning, and aesthetic quality.
  • Figure 2: An overview of inference latency between the proposed NAMI-2B and the corresponding FLUX-2B base model of the same size without NAMI. With NAMI, inference performance improvement becomes more significant as image resolution increases. The measurements are conducted with a batch size of 1 on an A100 GPU.
  • Figure 3: Overview of the image generation process for FLUX-dev blackforestlabs2024 and our NAMI-2B, with upscaling alignment applied during the low-resolution stages of NAMI-2B.
  • Figure 4: Overview of NAMI: The left figure shows the progressive flow transformers of NAMI, where the same color represents the same module. The right figure depicts the integration of the BridgeFlow module, which establishes connections across adjacent time windows. Specifically, we divide the image generation process into $K$ resolution stages and the entire flow is divided into $K$ time windows, where adjacent stages are connected through upsampling and the BridgeFlow module. We use fewer transformer layers at the low-resolution stages to generate image layouts and concept contours, progressively adding more layers as the resolution increases.
  • Figure 5: The distribution of text lengths across GenEval, DPG-Benchmark and NAMI-1K.
  • ...and 23 more figures