Table of Contents
Fetching ...

TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows

Zhenglin Cheng, Peng Sun, Jianguo Li, Tao Lin

TL;DR

TwinFlow introduces a self-adversarial, twin-trajectory training paradigm to achieve high-quality 1-step generation on large-scale multimodal models without relying on frozen teachers or auxiliary discriminators. By extending time to a symmetric interval and enforcing velocity-field alignment between real and self-generated trajectories, it yields strong 1-NFE performance (e.g., GenEval around 0.83–0.86) and scales to Qwen-Image-20B with minimal quality trade-offs. The approach unifies multi-step and few-step objectives under an any-step framework and demonstrates practical gains in both text-to-image synthesis and image editing, including full-parameter training at 20B. Ablation studies show the importance of the TwinFlow losses and batch-balancing hyperparameters for stability and performance. Overall, TwinFlow offers a simple, memory-efficient route to near state-of-the-art generation with dramatic inference-time savings for large models.

Abstract

Recent advances in large multi-modal generative models have demonstrated impressive capabilities in multi-modal generation, including image and video generation. These models are typically built upon multi-step frameworks like diffusion and flow matching, which inherently limits their inference efficiency (requiring 40-100 Number of Function Evaluations (NFEs)). While various few-step methods aim to accelerate the inference, existing solutions have clear limitations. Prominent distillation-based methods, such as progressive and consistency distillation, either require an iterative distillation procedure or show significant degradation at very few steps (< 4-NFE). Meanwhile, integrating adversarial training into distillation (e.g., DMD/DMD2 and SANA-Sprint) to enhance performance introduces training instability, added complexity, and high GPU memory overhead due to the auxiliary trained models. To this end, we propose TwinFlow, a simple yet effective framework for training 1-step generative models that bypasses the need of fixed pretrained teacher models and avoids standard adversarial networks during training, making it ideal for building large-scale, efficient models. On text-to-image tasks, our method achieves a GenEval score of 0.83 in 1-NFE, outperforming strong baselines like SANA-Sprint (a GAN loss-based framework) and RCGM (a consistency-based framework). Notably, we demonstrate the scalability of TwinFlow by full-parameter training on Qwen-Image-20B and transform it into an efficient few-step generator. With just 1-NFE, our approach matches the performance of the original 100-NFE model on both the GenEval and DPG-Bench benchmarks, reducing computational cost by $100\times$ with minor quality degradation. Project page is available at https://zhenglin-cheng.com/twinflow.

TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows

TL;DR

TwinFlow introduces a self-adversarial, twin-trajectory training paradigm to achieve high-quality 1-step generation on large-scale multimodal models without relying on frozen teachers or auxiliary discriminators. By extending time to a symmetric interval and enforcing velocity-field alignment between real and self-generated trajectories, it yields strong 1-NFE performance (e.g., GenEval around 0.83–0.86) and scales to Qwen-Image-20B with minimal quality trade-offs. The approach unifies multi-step and few-step objectives under an any-step framework and demonstrates practical gains in both text-to-image synthesis and image editing, including full-parameter training at 20B. Ablation studies show the importance of the TwinFlow losses and batch-balancing hyperparameters for stability and performance. Overall, TwinFlow offers a simple, memory-efficient route to near state-of-the-art generation with dramatic inference-time savings for large models.

Abstract

Recent advances in large multi-modal generative models have demonstrated impressive capabilities in multi-modal generation, including image and video generation. These models are typically built upon multi-step frameworks like diffusion and flow matching, which inherently limits their inference efficiency (requiring 40-100 Number of Function Evaluations (NFEs)). While various few-step methods aim to accelerate the inference, existing solutions have clear limitations. Prominent distillation-based methods, such as progressive and consistency distillation, either require an iterative distillation procedure or show significant degradation at very few steps (< 4-NFE). Meanwhile, integrating adversarial training into distillation (e.g., DMD/DMD2 and SANA-Sprint) to enhance performance introduces training instability, added complexity, and high GPU memory overhead due to the auxiliary trained models. To this end, we propose TwinFlow, a simple yet effective framework for training 1-step generative models that bypasses the need of fixed pretrained teacher models and avoids standard adversarial networks during training, making it ideal for building large-scale, efficient models. On text-to-image tasks, our method achieves a GenEval score of 0.83 in 1-NFE, outperforming strong baselines like SANA-Sprint (a GAN loss-based framework) and RCGM (a consistency-based framework). Notably, we demonstrate the scalability of TwinFlow by full-parameter training on Qwen-Image-20B and transform it into an efficient few-step generator. With just 1-NFE, our approach matches the performance of the original 100-NFE model on both the GenEval and DPG-Bench benchmarks, reducing computational cost by with minor quality degradation. Project page is available at https://zhenglin-cheng.com/twinflow.

Paper Structure

This paper contains 46 sections, 13 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: Results of Qwen-Image-20B-TwinFlow (NFE=2). See prompts in \ref{['app:used_prompts']}.
  • Figure 2: Overview of our TwinFlow and training GPU memory comparison. The GPU memory usage is measured on 1024$\times$1024 resolution on Qwen-Image-20B (LoRA tuning) and SANA-1.6B.
  • Figure 3: Visualization of images generated by Qwen-Image and Qwem-Image-TwinFlow w.r.t. NFEs. Qwen-Image-TwinFlow is capable of generating high-quality images with just 1 NFE, which is better than the original Qwen-Image's performance at 16 NFEs. Furthermore, when comparing 2-NFE results to the 32-NFE outputs of Qwen-Image, our method demonstrates better visual details. See prompts in \ref{['app:used_prompts']}.
  • Figure 4: Ablation studies of TwinFlow. Ablation presented in (a) and (c) are conducted on Qwen-Image-TwinFlow. Results shown in (b) are trained on the same dataset but with different models.
  • Figure 5: Comparison between Qwen-Image-TwinFlow and Qwen-Image-Lightning (1-NFE). The prompts and generated images are sourced from DPG-Bench. We observe that Qwen-Image-Lightning tend to generate very similar images though noise is different, which hurts diversity. Our model remains diversity and high quality generation.
  • ...and 7 more figures