Table of Contents
Fetching ...

Group Diffusion Transformers are Unsupervised Multitask Learners

Lianghua Huang, Wei Wang, Zhi-Fan Wu, Huanzhang Dou, Yupeng Shi, Yutong Feng, Chen Liang, Yu Liu, Jingren Zhou

TL;DR

GDTs unify diverse visual generation tasks under a single group generation framework and achieve zero-shot performance through unsupervised pretraining on large image groups, using minimal architectural changes to diffusion transformers. By concatenating group tokens in self-attention, GDTs learn cross-image correlations and support reference-based generation with SDEdit/inpainting, all accessible via a UI that translates natural language instructions into group prompts. A large-scale dataset of ~500k image groups, plus a 10k-group fine-tuning subset, enables scalable pretraining; ablations show that data scale, group size, and refinement techniques materially affect fidelity, consistency, and adherence. While competitive, the approach acknowledges gaps in image quality compared to state-of-the-art T2I models and points to future work on bigger datasets and extending to video tasks within the same group-generation paradigm.

Abstract

While large language models (LLMs) have revolutionized natural language processing with their task-agnostic capabilities, visual generation tasks such as image translation, style transfer, and character customization still rely heavily on supervised, task-specific datasets. In this work, we introduce Group Diffusion Transformers (GDTs), a novel framework that unifies diverse visual generation tasks by redefining them as a group generation problem. In this approach, a set of related images is generated simultaneously, optionally conditioned on a subset of the group. GDTs build upon diffusion transformers with minimal architectural modifications by concatenating self-attention tokens across images. This allows the model to implicitly capture cross-image relationships (e.g., identities, styles, layouts, surroundings, and color schemes) through caption-based correlations. Our design enables scalable, unsupervised, and task-agnostic pretraining using extensive collections of image groups sourced from multimodal internet articles, image galleries, and video frames. We evaluate GDTs on a comprehensive benchmark featuring over 200 instructions across 30 distinct visual generation tasks, including picture book creation, font design, style transfer, sketching, colorization, drawing sequence generation, and character customization. Our models achieve competitive zero-shot performance without any additional fine-tuning or gradient updates. Furthermore, ablation studies confirm the effectiveness of key components such as data scaling, group size, and model design. These results demonstrate the potential of GDTs as scalable, general-purpose visual generation systems.

Group Diffusion Transformers are Unsupervised Multitask Learners

TL;DR

GDTs unify diverse visual generation tasks under a single group generation framework and achieve zero-shot performance through unsupervised pretraining on large image groups, using minimal architectural changes to diffusion transformers. By concatenating group tokens in self-attention, GDTs learn cross-image correlations and support reference-based generation with SDEdit/inpainting, all accessible via a UI that translates natural language instructions into group prompts. A large-scale dataset of ~500k image groups, plus a 10k-group fine-tuning subset, enables scalable pretraining; ablations show that data scale, group size, and refinement techniques materially affect fidelity, consistency, and adherence. While competitive, the approach acknowledges gaps in image quality compared to state-of-the-art T2I models and points to future work on bigger datasets and extending to video tasks within the same group-generation paradigm.

Abstract

While large language models (LLMs) have revolutionized natural language processing with their task-agnostic capabilities, visual generation tasks such as image translation, style transfer, and character customization still rely heavily on supervised, task-specific datasets. In this work, we introduce Group Diffusion Transformers (GDTs), a novel framework that unifies diverse visual generation tasks by redefining them as a group generation problem. In this approach, a set of related images is generated simultaneously, optionally conditioned on a subset of the group. GDTs build upon diffusion transformers with minimal architectural modifications by concatenating self-attention tokens across images. This allows the model to implicitly capture cross-image relationships (e.g., identities, styles, layouts, surroundings, and color schemes) through caption-based correlations. Our design enables scalable, unsupervised, and task-agnostic pretraining using extensive collections of image groups sourced from multimodal internet articles, image galleries, and video frames. We evaluate GDTs on a comprehensive benchmark featuring over 200 instructions across 30 distinct visual generation tasks, including picture book creation, font design, style transfer, sketching, colorization, drawing sequence generation, and character customization. Our models achieve competitive zero-shot performance without any additional fine-tuning or gradient updates. Furthermore, ablation studies confirm the effectiveness of key components such as data scaling, group size, and model design. These results demonstrate the potential of GDTs as scalable, general-purpose visual generation systems.

Paper Structure

This paper contains 22 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1.1: Group Diffusion Transformers perform a vast array of visual generation tasks in a unified framework termed group generation. Note that NO task-specific dataset and NO additional gradient update is applied. The model is automatically generalized to these tasks after unsupervised training on image groups. For simplicity, textual descriptions of images are omitted here, which can be found in Appendix.
  • Figure 1.2: When conditioned on a subset of the group data, Group Diffusion Transformers could perform conditional group generation in the inpainting fashion. Note that the model is automatically generalized to these tasks after unsupervised training on image groups. Textual descriptions of images are omitted here (can be found in Appendix), and we summarize them into brief task descriptions.
  • Figure 2.1: The overview of Group Diffusion Transformer, which takes minimal adaptations for the encoder-decoder and encoder-only visual generation architectures. We make a straightforward modification on self-attention blocks by concatenating image tokens across group inputs, allowing to learn cross-image correlations.
  • Figure 2.2: Distribution of group size in our training dataset.
  • Figure 2.3: Example of our training dataset, where the group images are captioned through prompting our internal MLLMs.
  • ...and 3 more figures