Low-Biased General Annotated Dataset Generation

Dengyang Jiang; Haoyu Wang; Lei Zhang; Wei Wei; Guang Dai; Mengmeng Wang; Jingdong Wang; Yanning Zhang

Low-Biased General Annotated Dataset Generation

Dengyang Jiang, Haoyu Wang, Lei Zhang, Wei Wei, Guang Dai, Mengmeng Wang, Jingdong Wang, Yanning Zhang

TL;DR

The paper tackles the problem that manually collected general datasets contain non-transferable biases that hurt cross-domain generalization. It proposes lbGen, which fine-tunes a diffusion model using CLIP-based bi-level semantic alignment losses—$\,\mathcal{L}_{en}$ at the dataset level and $\,\mathcal{L}_{in}$ at the per-image level—to produce low-biased, category-annotated images using only the target dataset’s category names, plus a quality assurance loss $\mathcal{L}_{q}$. The final objective combines these terms as $\mathcal{L} = \mathcal{L}_{bi} + \lambda_1 \mathcal{L}_{q}$ with $\mathcal{L}_{bi} = \mathcal{L}_{en} + \mathcal{L}_{in}$. Empirical results show backbone networks pre-trained on lbGen data generalize more stably across downstream tasks, especially when task-specific labeled data are scarce, and lbGen improves robustness to common dataset biases. This approach enables scalable, bias-aware pre-training by generating high-quality, low-bias synthetic data guided solely by class-name prompts.

Abstract

Pre-training backbone networks on a general annotated dataset (e.g., ImageNet) that comprises numerous manually collected images with category annotations has proven to be indispensable for enhancing the generalization capacity of downstream visual tasks. However, those manually collected images often exhibit bias, which is non-transferable across either categories or domains, thus causing the model's generalization capacity degeneration. To mitigate this problem, we present a low-biased general annotated dataset generation framework (lbGen). Instead of expensive manual collection, we aim at directly generating low-biased images with category annotations. To achieve this goal, we propose to leverage the advantage of a multimodal foundation model (e.g., CLIP), in terms of aligning images in a low-biased semantic space defined by language. Specifically, we develop a bi-level semantic alignment loss, which not only forces all generated images to be consistent with the semantic distribution of all categories belonging to the target dataset in an adversarial learning manner, but also requires each generated image to match the semantic description of its category name. In addition, we further cast an existing image quality scoring model into a quality assurance loss to preserve the quality of the generated image. By leveraging these two loss functions, we can obtain a low-biased image generation model by simply fine-tuning a pre-trained diffusion model using only all category names in the target dataset as input. Experimental results confirm that, compared with the manually labeled dataset or other synthetic datasets, the utilization of our generated low-biased dataset leads to stable generalization capacity enhancement of different backbone networks across various tasks, especially in tasks where the manually labeled samples are scarce.

Low-Biased General Annotated Dataset Generation

TL;DR

Abstract

Low-Biased General Annotated Dataset Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)