Table of Contents
Fetching ...

CoLoGen: Progressive Learning of Concept-Localization Duality for Unified Image Generation

YuXin Song, Yu Lu, Haoyuan Sun, Huanjin Yao, Fanglong Liu, Yifan Sun, Haocheng Feng, Hang Zhou, Jingdong Wang

TL;DR

CoLoGen is proposed, a unified diffusion framework that progressively learns and reconciles this concept-localization duality in unified conditional image generation, offering a principled representational perspective for unified image generation.

Abstract

Unified conditional image generation remains difficult because different tasks depend on fundamentally different internal representations. Some require conceptual understanding for semantic synthesis, while others rely on localization cues for spatial precision. Forcing these heterogeneous tasks to share a single representation leads to concept-localization representational conflict. To address this issue, we propose CoLoGen, a unified diffusion framework that progressively learns and reconciles this concept-localization duality. CoLoGen uses a staged curriculum that first builds core conceptual and localization abilities, then adapts them to diverse visual conditions, and finally refines their synergy for complex instruction-driven tasks. Central to this process is the Progressive Representation Weaving (PRW) module, which dynamically routes features to specialized experts and stably integrates their outputs across stages. Experiments on editing, controllable generation, and customized generation show that CoLoGen achieves competitive or superior performance, offering a principled representational perspective for unified image generation.

CoLoGen: Progressive Learning of Concept-Localization Duality for Unified Image Generation

TL;DR

CoLoGen is proposed, a unified diffusion framework that progressively learns and reconciles this concept-localization duality in unified conditional image generation, offering a principled representational perspective for unified image generation.

Abstract

Unified conditional image generation remains difficult because different tasks depend on fundamentally different internal representations. Some require conceptual understanding for semantic synthesis, while others rely on localization cues for spatial precision. Forcing these heterogeneous tasks to share a single representation leads to concept-localization representational conflict. To address this issue, we propose CoLoGen, a unified diffusion framework that progressively learns and reconciles this concept-localization duality. CoLoGen uses a staged curriculum that first builds core conceptual and localization abilities, then adapts them to diverse visual conditions, and finally refines their synergy for complex instruction-driven tasks. Central to this process is the Progressive Representation Weaving (PRW) module, which dynamically routes features to specialized experts and stably integrates their outputs across stages. Experiments on editing, controllable generation, and customized generation show that CoLoGen achieves competitive or superior performance, offering a principled representational perspective for unified image generation.
Paper Structure (21 sections, 8 equations, 8 figures, 8 tables)

This paper contains 21 sections, 8 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: The comparison between the multi-task learning strategy (a) and the ours progressive staged training (b) within the framework of unified multi-modal image generation. We specifically examines five conventional tasks: mask inpainting, image grounding, controllable image generation, customized image generation, and instruction-based image editing.
  • Figure 2: The overall framework of the unified Multi-modal to Image generation model, CoLoGen. For each training stage, CoLoGen efficiently integrates a set of condition-specific experts via the Progressive Representation Weaving (PRW) which are constructed on KV projection layers and a dynamic router $G$. Notably, The QKV projection layer and the self-attention layer are sharing weights for the inputs of Noisy Latent and Source Image Latent. CoLoGen employs a progressive staged training strategy to gradually increase the number of experts $E_k$, allowing it to better adapt to more complex downstream tasks.
  • Figure 3: Qualitative comparison for customized image generation. We compare with a series of public SOTA methods including OmniGen xiao2024omnigen, MS-Diffusion wang2024ms, and IP-Adapter Flux-IPadapter on DreamBench ruiz2023dreambooth. CoLoGen achieves remarkable performance even when trained on a limited amount of proprietary data, which can be attributed to the rich multimodal knowledge acquired by the model during the endogenous pre-training and conditional injection learning phases.
  • Figure 4: Ablation studies for hyperparameter of lifelong strategy in the last stage, with final settings highlighted in orange. (a) Impact of $\alpha$ on the weight for balancing the veteran gate routing supervision. (b) Influence of $rank$ on LoRA, where LoRA alpha weight defaults to twice the $rank$. (c) Impact of $\rho$, which denotes the rounting density of $expert_{N-1}$.
  • Figure 5: Qualitative comparisons with current state-of-the-art mask-inpainting methods. CoLoGen demonstrates robust text-following capabilities, and exhibits strong visual coherence between the mask area and the background.
  • ...and 3 more figures