Table of Contents
Fetching ...

Low-Biased General Annotated Dataset Generation

Dengyang Jiang, Haoyu Wang, Lei Zhang, Wei Wei, Guang Dai, Mengmeng Wang, Jingdong Wang, Yanning Zhang

TL;DR

The paper tackles the problem that manually collected general datasets contain non-transferable biases that hurt cross-domain generalization. It proposes lbGen, which fine-tunes a diffusion model using CLIP-based bi-level semantic alignment losses—$\,\mathcal{L}_{en}$ at the dataset level and $\,\mathcal{L}_{in}$ at the per-image level—to produce low-biased, category-annotated images using only the target dataset’s category names, plus a quality assurance loss $\mathcal{L}_{q}$. The final objective combines these terms as $\mathcal{L} = \mathcal{L}_{bi} + \lambda_1 \mathcal{L}_{q}$ with $\mathcal{L}_{bi} = \mathcal{L}_{en} + \mathcal{L}_{in}$. Empirical results show backbone networks pre-trained on lbGen data generalize more stably across downstream tasks, especially when task-specific labeled data are scarce, and lbGen improves robustness to common dataset biases. This approach enables scalable, bias-aware pre-training by generating high-quality, low-bias synthetic data guided solely by class-name prompts.

Abstract

Pre-training backbone networks on a general annotated dataset (e.g., ImageNet) that comprises numerous manually collected images with category annotations has proven to be indispensable for enhancing the generalization capacity of downstream visual tasks. However, those manually collected images often exhibit bias, which is non-transferable across either categories or domains, thus causing the model's generalization capacity degeneration. To mitigate this problem, we present a low-biased general annotated dataset generation framework (lbGen). Instead of expensive manual collection, we aim at directly generating low-biased images with category annotations. To achieve this goal, we propose to leverage the advantage of a multimodal foundation model (e.g., CLIP), in terms of aligning images in a low-biased semantic space defined by language. Specifically, we develop a bi-level semantic alignment loss, which not only forces all generated images to be consistent with the semantic distribution of all categories belonging to the target dataset in an adversarial learning manner, but also requires each generated image to match the semantic description of its category name. In addition, we further cast an existing image quality scoring model into a quality assurance loss to preserve the quality of the generated image. By leveraging these two loss functions, we can obtain a low-biased image generation model by simply fine-tuning a pre-trained diffusion model using only all category names in the target dataset as input. Experimental results confirm that, compared with the manually labeled dataset or other synthetic datasets, the utilization of our generated low-biased dataset leads to stable generalization capacity enhancement of different backbone networks across various tasks, especially in tasks where the manually labeled samples are scarce.

Low-Biased General Annotated Dataset Generation

TL;DR

The paper tackles the problem that manually collected general datasets contain non-transferable biases that hurt cross-domain generalization. It proposes lbGen, which fine-tunes a diffusion model using CLIP-based bi-level semantic alignment losses— at the dataset level and at the per-image level—to produce low-biased, category-annotated images using only the target dataset’s category names, plus a quality assurance loss . The final objective combines these terms as with . Empirical results show backbone networks pre-trained on lbGen data generalize more stably across downstream tasks, especially when task-specific labeled data are scarce, and lbGen improves robustness to common dataset biases. This approach enables scalable, bias-aware pre-training by generating high-quality, low-bias synthetic data guided solely by class-name prompts.

Abstract

Pre-training backbone networks on a general annotated dataset (e.g., ImageNet) that comprises numerous manually collected images with category annotations has proven to be indispensable for enhancing the generalization capacity of downstream visual tasks. However, those manually collected images often exhibit bias, which is non-transferable across either categories or domains, thus causing the model's generalization capacity degeneration. To mitigate this problem, we present a low-biased general annotated dataset generation framework (lbGen). Instead of expensive manual collection, we aim at directly generating low-biased images with category annotations. To achieve this goal, we propose to leverage the advantage of a multimodal foundation model (e.g., CLIP), in terms of aligning images in a low-biased semantic space defined by language. Specifically, we develop a bi-level semantic alignment loss, which not only forces all generated images to be consistent with the semantic distribution of all categories belonging to the target dataset in an adversarial learning manner, but also requires each generated image to match the semantic description of its category name. In addition, we further cast an existing image quality scoring model into a quality assurance loss to preserve the quality of the generated image. By leveraging these two loss functions, we can obtain a low-biased image generation model by simply fine-tuning a pre-trained diffusion model using only all category names in the target dataset as input. Experimental results confirm that, compared with the manually labeled dataset or other synthetic datasets, the utilization of our generated low-biased dataset leads to stable generalization capacity enhancement of different backbone networks across various tasks, especially in tasks where the manually labeled samples are scarce.

Paper Structure

This paper contains 22 sections, 11 equations, 6 figures, 8 tables, 1 algorithm.

Figures (6)

  • Figure 1: Visualization of some randomly sampled images from 4 datasets. It is hard to tell from which dataset exhibits low bias through these images. However, models trained on these four datasets demonstrate a significant disparity in their generalization capabilities.
  • Figure 2: Overview of our training method. The generator first generates an image according to the class name. Then the image is sent to bi-level semantic guidance module and quality assurance module respectively for loss calculation.
  • Figure 3: Scaling down the number of training images of eight transfer learning datasets. The benefits of using pre-trained models on our lbGen images are even more pronounced when there is less data for training.
  • Figure 4: Impact of individual image alignmentloss. We observe that ambiguity problem between classes when discarding $\mathcal{L}_{in}$.
  • Figure 5: Effectiveness of quality assurance loss. After adding $\mathcal{L}_{q}$, the image blur problem is solved.
  • ...and 1 more figures