Table of Contents
Fetching ...

Semantic-Guided Generative Image Augmentation Method with Diffusion Models for Image Classification

Bohan Li, Xiao Xu, Xinghao Wang, Yutai Hou, Yunlong Feng, Feng Wang, Xuanliang Zhang, Qingfu Zhu, Wanxiang Che

TL;DR

SGID addresses data augmentation for image classification by balancing image diversity with semantic fidelity. It uses a diffusion-guided, two-step pipeline that incorporates image labels and BLIP-generated captions to steer image-to-image generation via Stable Diffusion, with a similarity-driven guidance scale to preserve semantics. Across seven datasets and three backbones, SGID delivers substantial accuracy gains and can further boost performance when combined with other augmentation baselines, demonstrated through both quantitative metrics and qualitative analyses of diversity and semantic consistency. The approach offers a practical pathway to leverage semantic guidance in diffusion models for robust, semantically faithful data augmentation with broad applicability.

Abstract

Existing image augmentation methods consist of two categories: perturbation-based methods and generative methods. Perturbation-based methods apply pre-defined perturbations to augment an original image, but only locally vary the image, thus lacking image diversity. In contrast, generative methods bring more image diversity in the augmented images but may not preserve semantic consistency, thus incorrectly changing the essential semantics of the original image. To balance image diversity and semantic consistency in augmented images, we propose SGID, a Semantic-guided Generative Image augmentation method with Diffusion models for image classification. Specifically, SGID employs diffusion models to generate augmented images with good image diversity. More importantly, SGID takes image labels and captions as guidance to maintain semantic consistency between the augmented and original images. Experimental results show that SGID outperforms the best augmentation baseline by 1.72% on ResNet-50 (from scratch), 0.33% on ViT (ImageNet-21k), and 0.14% on CLIP-ViT (LAION-2B). Moreover, SGID can be combined with other image augmentation baselines and further improves the overall performance. We demonstrate the semantic consistency and image diversity of SGID through quantitative human and automated evaluations, as well as qualitative case studies.

Semantic-Guided Generative Image Augmentation Method with Diffusion Models for Image Classification

TL;DR

SGID addresses data augmentation for image classification by balancing image diversity with semantic fidelity. It uses a diffusion-guided, two-step pipeline that incorporates image labels and BLIP-generated captions to steer image-to-image generation via Stable Diffusion, with a similarity-driven guidance scale to preserve semantics. Across seven datasets and three backbones, SGID delivers substantial accuracy gains and can further boost performance when combined with other augmentation baselines, demonstrated through both quantitative metrics and qualitative analyses of diversity and semantic consistency. The approach offers a practical pathway to leverage semantic guidance in diffusion models for robust, semantically faithful data augmentation with broad applicability.

Abstract

Existing image augmentation methods consist of two categories: perturbation-based methods and generative methods. Perturbation-based methods apply pre-defined perturbations to augment an original image, but only locally vary the image, thus lacking image diversity. In contrast, generative methods bring more image diversity in the augmented images but may not preserve semantic consistency, thus incorrectly changing the essential semantics of the original image. To balance image diversity and semantic consistency in augmented images, we propose SGID, a Semantic-guided Generative Image augmentation method with Diffusion models for image classification. Specifically, SGID employs diffusion models to generate augmented images with good image diversity. More importantly, SGID takes image labels and captions as guidance to maintain semantic consistency between the augmented and original images. Experimental results show that SGID outperforms the best augmentation baseline by 1.72% on ResNet-50 (from scratch), 0.33% on ViT (ImageNet-21k), and 0.14% on CLIP-ViT (LAION-2B). Moreover, SGID can be combined with other image augmentation baselines and further improves the overall performance. We demonstrate the semantic consistency and image diversity of SGID through quantitative human and automated evaluations, as well as qualitative case studies.
Paper Structure (29 sections, 3 equations, 13 figures, 8 tables)

This paper contains 29 sections, 3 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: A comparison of four baseline methods and our proposed SGID using the ViT (ImageNet-21k) backbone across four datasets. Our SGID strikes a balance between semantic consistency and image diversity, leading to the highest performance improvement.
  • Figure 2: An illustration of our SGID. Step 1 first collects textual labels for each image, then use BLIP to generate captions, and then use CLIP to calculate the similarity between the chosen caption and the original image. Step 2 generates the augmented images through the Stable Diffusion model, utilizing the original image, the prompt consisting of the textual label and caption, the noise rate, and the guidance scale based on the similarity.
  • Figure 3: Average cosine similarity between the augmented and original images for each category of CIFAR-10 (air.: airplane, auto.: automobile).
  • Figure 4: Case study of seven DA methods.
  • Figure 5: Case study on the influence of noise rate and guidance scale when generating images by SGID on Cars.
  • ...and 8 more figures