Table of Contents
Fetching ...

Data Factory with Minimal Human Effort Using VLMs

Jiaojiao Ye, Jiaxing Zhong, Qian Xie, Yuzhou Zhou, Niki Trigoni, Andrew Markham

TL;DR

This work tackles the data hunger of semantic segmentation by introducing Diffusion Synthesis, a training-free pipeline that couples Vision-Language Models with ControlNet to generate pixel-precise synthetic data. The approach uses three modules—Multi-way Prompt Generator, Mask Generator, and High-quality Image Selection—to produce diverse, high-fidelity image–mask pairs with minimal human effort, and balances real and synthetic data to train few-shot segmentation models. Empirical results on PASCAL-$5^i$ and COCO-$20^i$ show substantial improvements over prior diffusion-based methods (e.g., better mIoU and lower FID), demonstrating the practicality and effectiveness of automatic data synthesis for low-data regimes. The work highlights the potential for domain-targeted, controllable augmentation to reduce labeling costs while boosting downstream performance, with limitations including biases from large multimodal models and task/domain sensitivity.

Abstract

Generating enough and diverse data through augmentation offers an efficient solution to the time-consuming and labour-intensive process of collecting and annotating pixel-wise images. Traditional data augmentation techniques often face challenges in manipulating high-level semantic attributes, such as materials and textures. In contrast, diffusion models offer a robust alternative, by effectively utilizing text-to-image or image-to-image transformation. However, existing diffusion-based methods are either computationally expensive or compromise on performance. To address this issue, we introduce a novel training-free pipeline that integrates pretrained ControlNet and Vision-Language Models (VLMs) to generate synthetic images paired with pixel-level labels. This approach eliminates the need for manual annotations and significantly improves downstream tasks. To improve the fidelity and diversity, we add a Multi-way Prompt Generator, Mask Generator and High-quality Image Selection module. Our results on PASCAL-5i and COCO-20i present promising performance and outperform concurrent work for one-shot semantic segmentation.

Data Factory with Minimal Human Effort Using VLMs

TL;DR

This work tackles the data hunger of semantic segmentation by introducing Diffusion Synthesis, a training-free pipeline that couples Vision-Language Models with ControlNet to generate pixel-precise synthetic data. The approach uses three modules—Multi-way Prompt Generator, Mask Generator, and High-quality Image Selection—to produce diverse, high-fidelity image–mask pairs with minimal human effort, and balances real and synthetic data to train few-shot segmentation models. Empirical results on PASCAL- and COCO- show substantial improvements over prior diffusion-based methods (e.g., better mIoU and lower FID), demonstrating the practicality and effectiveness of automatic data synthesis for low-data regimes. The work highlights the potential for domain-targeted, controllable augmentation to reduce labeling costs while boosting downstream performance, with limitations including biases from large multimodal models and task/domain sensitivity.

Abstract

Generating enough and diverse data through augmentation offers an efficient solution to the time-consuming and labour-intensive process of collecting and annotating pixel-wise images. Traditional data augmentation techniques often face challenges in manipulating high-level semantic attributes, such as materials and textures. In contrast, diffusion models offer a robust alternative, by effectively utilizing text-to-image or image-to-image transformation. However, existing diffusion-based methods are either computationally expensive or compromise on performance. To address this issue, we introduce a novel training-free pipeline that integrates pretrained ControlNet and Vision-Language Models (VLMs) to generate synthetic images paired with pixel-level labels. This approach eliminates the need for manual annotations and significantly improves downstream tasks. To improve the fidelity and diversity, we add a Multi-way Prompt Generator, Mask Generator and High-quality Image Selection module. Our results on PASCAL-5i and COCO-20i present promising performance and outperform concurrent work for one-shot semantic segmentation.

Paper Structure

This paper contains 36 sections, 3 equations, 21 figures, 7 tables.

Figures (21)

  • Figure 1: Given an image dataset, our pipeline can generate K synthetic images paired with pixel-level labels using VLMs and a pre-trained Controlnet checkpoint. The resulting synthetic images are mixed with real images for training downstream tasks.
  • Figure 2: Overview of the Diffusion Synthesis framework, consisting of three seperated stages. Given an image dataset, our pipeline can generate $K$ synthetic images paired with pixel-level masks through image-to-image translation using VLMs and a pre-trained Diffusion Model. Resulted synthetic images are combined with real images for training downstream segmentation tasks.
  • Figure 4: Illustration of our selection module. This module has two sub-processes: cosine similarity filtration and foundation-model driven matching. This module takes the source image, synthesized image, and previous mask as input, then produces high-quality filtered images as output.
  • Figure 5: Visualization of one-shot semantic segmentation.
  • Figure 6: Synthetic dataset performance comparison between ours and synth-VOC.
  • ...and 16 more figures