Table of Contents
Fetching ...

CoRe^2: Collect, Reflect and Refine to Generate Better and Faster

Shitong Shao, Zikai Zhou, Dian Xie, Yuetong Fang, Tian Ye, Lichen Bai, Zeke Xie

TL;DR

CoRe^2 introduces a three-stage, plug-and-play inference framework that bridges speed and fidelity for both diffusion models and visual autoregressive models by collecting CFG trajectories, learning a lightweight weak model to reflect easy-to-learn content, and applying weak-to-strong refinement to recover high-frequency details. The fast and slow inference modes, governed by W2S guidance, enable substantial latency reduction while preserving or improving image quality, with Z-CoRe^2 offering further gains via Z-Sampling. The approach generalizes across SDXL, SD3.5, FLUX, and LlamaGen, delivering consistent improvements on major benchmarks such as Pick-of-Pic, DrawBench, HPD v2, GenEval, and T2I-Compbench, and outperforming state-of-the-art inference methods with modest overhead. This work provides both practical gains for real-time T2I generation and theoretical backing for why weak-to-strong guidance improves high-frequency details in complex scenes.

Abstract

Making text-to-image (T2I) generative model sample both fast and well represents a promising research direction. Previous studies have typically focused on either enhancing the visual quality of synthesized images at the expense of sampling efficiency or dramatically accelerating sampling without improving the base model's generative capacity. Moreover, nearly all inference methods have not been able to ensure stable performance simultaneously on both diffusion models (DMs) and visual autoregressive models (ARMs). In this paper, we introduce a novel plug-and-play inference paradigm, CoRe^2, which comprises three subprocesses: Collect, Reflect, and Refine. CoRe^2 first collects classifier-free guidance (CFG) trajectories, and then use collected data to train a weak model that reflects the easy-to-learn contents while reducing number of function evaluations during inference by half. Subsequently, CoRe^2 employs weak-to-strong guidance to refine the conditional output, thereby improving the model's capacity to generate high-frequency and realistic content, which is difficult for the base model to capture. To the best of our knowledge, CoRe^2 is the first to demonstrate both efficiency and effectiveness across a wide range of DMs, including SDXL, SD3.5, and FLUX, as well as ARMs like LlamaGen. It has exhibited significant performance improvements on HPD v2, Pick-of-Pic, Drawbench, GenEval, and T2I-Compbench. Furthermore, CoRe^2 can be seamlessly integrated with the state-of-the-art Z-Sampling, outperforming it by 0.3 and 0.16 on PickScore and AES, while achieving 5.64s time saving using SD3.5.Code is released at https://github.com/xie-lab-ml/CoRe/tree/main.

CoRe^2: Collect, Reflect and Refine to Generate Better and Faster

TL;DR

CoRe^2 introduces a three-stage, plug-and-play inference framework that bridges speed and fidelity for both diffusion models and visual autoregressive models by collecting CFG trajectories, learning a lightweight weak model to reflect easy-to-learn content, and applying weak-to-strong refinement to recover high-frequency details. The fast and slow inference modes, governed by W2S guidance, enable substantial latency reduction while preserving or improving image quality, with Z-CoRe^2 offering further gains via Z-Sampling. The approach generalizes across SDXL, SD3.5, FLUX, and LlamaGen, delivering consistent improvements on major benchmarks such as Pick-of-Pic, DrawBench, HPD v2, GenEval, and T2I-Compbench, and outperforming state-of-the-art inference methods with modest overhead. This work provides both practical gains for real-time T2I generation and theoretical backing for why weak-to-strong guidance improves high-frequency details in complex scenes.

Abstract

Making text-to-image (T2I) generative model sample both fast and well represents a promising research direction. Previous studies have typically focused on either enhancing the visual quality of synthesized images at the expense of sampling efficiency or dramatically accelerating sampling without improving the base model's generative capacity. Moreover, nearly all inference methods have not been able to ensure stable performance simultaneously on both diffusion models (DMs) and visual autoregressive models (ARMs). In this paper, we introduce a novel plug-and-play inference paradigm, CoRe^2, which comprises three subprocesses: Collect, Reflect, and Refine. CoRe^2 first collects classifier-free guidance (CFG) trajectories, and then use collected data to train a weak model that reflects the easy-to-learn contents while reducing number of function evaluations during inference by half. Subsequently, CoRe^2 employs weak-to-strong guidance to refine the conditional output, thereby improving the model's capacity to generate high-frequency and realistic content, which is difficult for the base model to capture. To the best of our knowledge, CoRe^2 is the first to demonstrate both efficiency and effectiveness across a wide range of DMs, including SDXL, SD3.5, and FLUX, as well as ARMs like LlamaGen. It has exhibited significant performance improvements on HPD v2, Pick-of-Pic, Drawbench, GenEval, and T2I-Compbench. Furthermore, CoRe^2 can be seamlessly integrated with the state-of-the-art Z-Sampling, outperforming it by 0.3 and 0.16 on PickScore and AES, while achieving 5.64s time saving using SD3.5.Code is released at https://github.com/xie-lab-ml/CoRe/tree/main.

Paper Structure

This paper contains 40 sections, 3 theorems, 21 equations, 21 figures, 8 tables.

Key Result

Theorem 3.2

(the proof in Appendix apd:do_w2s_work) Assume there are two macro-level DMs, denoted as $\epsilon_\theta^\textrm{weak}$ and $\epsilon_\theta^\textrm{strong}$. According to Definition sec:def1 (the symbols in Eq. eq:condition_main_paper have the same meaning as in Definition sec:def1 when the right There exists $\omega_\textrm{w2s} > 1$ (i.e., the W2S guidance scale) such that the mean square err

Figures (21)

  • Figure 1: Left: Our proposed CoRe$^2$ achieves an excellent balance between performance and efficiency across SD3.5, SDXL, and LlamaGen. Specifically, for SD3.5 and SDXL, it produces more faithful, realistic, and detailed images with high semantic consistency, while significantly reducing computational overhead compared to standard sampling. Even on LlamaGen, CoRe$^2$ enhances the model's generative capabilities with only a minimal increase in computational cost. Right: Compared to previous inference-enhanced algorithms, CoRe$^2$ achieves optimal performance across the three dimensions of efficiency, generalization, and effectiveness.
  • Figure 2: A concise explanation of why CoRe$^2$ is effective (diffusion model for an example): we train a weak model to reflect the easy-to-learn components. Then, W2S guidance is employed to refine the more (fine-grained) difficult-to-learn components.
  • Figure 3: Overview of Collect, Reflect and Refine (CoRe$^2$). We initially generate trajectories corresponding to CFG to collect data. Next, we train a weak model (i.e., the noise model equipped with MoE-LoRA) to capture the mapping from the conditional output to the CFG output, reflecting the easy-to-learn content. Finally, we employ W2S guidance to refine the conditional output (i.e., the fast mode) and the CFG output (i.e., slow mode), thereby enhancing the critical fine-grained information that is challenging to learn.
  • Figure 4: Framework of noise model's backbone and MoE-LoRA. We employ MM-DiT-Block to construct noise model for SD3.5 and SDXL, and we utilize MoE-LoRA to reflect easy-to-learn content across different timesteps. For LlamaGen, we replace the backbone with its native Llama block llama w/o MoE-LoRA.
  • Figure 5: Visualization of the frequency histogram with the W2S guidance $\epsilon^t_\textrm{strong}\!-\!\epsilon^t_\textrm{weak}$ and CFG $\epsilon^t_\textrm{cond}\!-\!\epsilon^t_\textrm{uncond}$ in SDXL. Note that W2S guidance in CoRe$^2$ incorporates more high-frequency information compared to standard CFG. This effectively mitigates the base model's limitations in generating fine-grained yet challenging-to-learn content during the pre-training phase.
  • ...and 16 more figures

Theorems & Definitions (6)

  • Definition 3.1
  • Theorem 3.2
  • Lemma C.1
  • Definition C.2
  • Theorem C.3
  • proof