Table of Contents
Fetching ...

Imbalance in Balance: Online Concept Balancing in Generation Models

Yukai Shi, Jiarong Ou, Rui Chen, Haotian Yang, Jiahao Wang, Xin Tao, Pengfei Wan, Di Zhang, Kun Gai

TL;DR

The paper tackles unstable concept composition in text-to-image generation by analyzing causal factors and introducing an online, concept-wise balancing method. It introduces IMBA distance to quantify data distribution and IMBA loss to dynamically reweight concept regions during training, all without offline dataset pruning and with minimal code changes. A new benchmark, Inert-CompBench, targets inert concepts to stress-test compositional ability, alongside existing benchmarks. Empirical results show significant improvements in concept composition across multiple benchmarks, and ablation analyses validate the method’s efficiency, scalability, and compatibility with diffusion models. The work demonstrates that data distribution, rather than sheer scale or model size, is a primary determinant of composition quality at large scale, providing a practical, plug-and-play solution for robust concept synthesis in open-world generation tasks.

Abstract

In visual generation tasks, the responses and combinations of complex concepts often lack stability and are error-prone, which remains an under-explored area. In this paper, we attempt to explore the causal factors for poor concept responses through elaborately designed experiments. We also design a concept-wise equalization loss function (IMBA loss) to address this issue. Our proposed method is online, eliminating the need for offline dataset processing, and requires minimal code changes. In our newly proposed complex concept benchmark Inert-CompBench and two other public test sets, our method significantly enhances the concept response capability of baseline models and yields highly competitive results with only a few codes released at https://github.com/KwaiVGI/IMBA-Loss.

Imbalance in Balance: Online Concept Balancing in Generation Models

TL;DR

The paper tackles unstable concept composition in text-to-image generation by analyzing causal factors and introducing an online, concept-wise balancing method. It introduces IMBA distance to quantify data distribution and IMBA loss to dynamically reweight concept regions during training, all without offline dataset pruning and with minimal code changes. A new benchmark, Inert-CompBench, targets inert concepts to stress-test compositional ability, alongside existing benchmarks. Empirical results show significant improvements in concept composition across multiple benchmarks, and ablation analyses validate the method’s efficiency, scalability, and compatibility with diffusion models. The work demonstrates that data distribution, rather than sheer scale or model size, is a primary determinant of composition quality at large scale, providing a practical, plug-and-play solution for robust concept synthesis in open-world generation tasks.

Abstract

In visual generation tasks, the responses and combinations of complex concepts often lack stability and are error-prone, which remains an under-explored area. In this paper, we attempt to explore the causal factors for poor concept responses through elaborately designed experiments. We also design a concept-wise equalization loss function (IMBA loss) to address this issue. Our proposed method is online, eliminating the need for offline dataset processing, and requires minimal code changes. In our newly proposed complex concept benchmark Inert-CompBench and two other public test sets, our method significantly enhances the concept response capability of baseline models and yields highly competitive results with only a few codes released at https://github.com/KwaiVGI/IMBA-Loss.

Paper Structure

This paper contains 27 sections, 11 equations, 14 figures, 5 tables, 2 algorithms.

Figures (14)

  • Figure 1: Our method achieves better concept composition ability with much smaller dataset (31M). Existing models face missing object, attribute leakage, and concept entanglement problem. Specifically, Figure(a)(b) miss the expected concepts (twins, feather). Figure(c)(d) incorrectly match the attribute of the subjects. Figure(e)(f) exists unnecessary concepts (fork, legs).
  • Figure 2: Concept distribution of the datasets, which follows long-tail distribution.
  • Figure 3: The performance of models with different parameter sizes under the LC-Mis benchmark zhao2024lost.
  • Figure 4: The performance of models with different data scales and distributions.
  • Figure 5: Comparison between results from balance and imbalance datasets. We simulate the training and inference results of diffusion models in a 2-dimensional space. With the dataset consisted by two classes (brown and purple points), diffusion models map random noise(blue points) to the prediction(yellow points) with flow matching(green curve). Comparing Figure (a) and (c), imbalanced data leads to a shift from black box to red box on the prediction of tail concepts, harming the generalization of the tail concept(purple points). Comparing Figure (b) and (d), imbalanced data makes unconditional score distribution tilt towards the head concept(brown points) from black arrow to red arrow, proving that the difference between unconditional and conditional score distributions can serve as a metric for dataset distribution.
  • ...and 9 more figures