Table of Contents
Fetching ...

AnySynth: Harnessing the Power of Image Synthetic Data Generation for Generalized Vision-Language Tasks

You Li, Fan Ma, Yi Yang

TL;DR

AnySynth introduces a unified Layout-Image-Annotation pipeline to synthesize versatile training data for generalized vision-language tasks. It deploys three modules—Task-Specific Layout Generation (LLMs + layout priors), Uni-Controlled Image Generation (MIGC with style and reference guidance), and Task-Oriented Annotation (refinement and QA-style data)—to produce task-consistent content and annotations. Across Few-shot Object Detection, Cross-domain Object Detection, Zero-shot Composed Image Retrieval, and Multi-modal Perception, AnySynth demonstrates consistent performance gains and clear ablations validating each module. The framework offers practical impact by reducing data collection costs while enabling broad applicability, with future work targeting domain-specific domains like medical imaging and remote sensing.

Abstract

Diffusion models have recently been employed to generate high-quality images, reducing the need for manual data collection and improving model generalization in tasks such as object detection, instance segmentation, and image perception. However, the synthetic framework is usually designed with meticulous human effort for each task due to various requirements on image layout, content, and annotation formats, restricting the application of synthetic data on more general scenarios. In this paper, we propose AnySynth, a unified framework integrating adaptable, comprehensive, and highly controllable components capable of generating an arbitrary type of synthetic data given diverse requirements. Specifically, the Task-Specific Layout Generation Module is first introduced to produce reasonable layouts for different tasks by leveraging the generation ability of large language models and layout priors of real-world images. A Uni-Controlled Image Generation Module is then developed to create high-quality synthetic images that are controllable and based on the generated layouts. In addition, user specific reference images, and style images can be incorporated into the generation to task requirements. Finally, the Task-Oriented Annotation Module offers precise and detailed annotations for the generated images across different tasks. We have validated our framework's performance across various tasks, including Few-shot Object Detection, Cross-domain Object Detection, Zero-shot Composed Image Retrieval, and Multi-modal Image Perception and Grounding. The specific data synthesized by our framework significantly improves model performance in these tasks, demonstrating the generality and effectiveness of our framework.

AnySynth: Harnessing the Power of Image Synthetic Data Generation for Generalized Vision-Language Tasks

TL;DR

AnySynth introduces a unified Layout-Image-Annotation pipeline to synthesize versatile training data for generalized vision-language tasks. It deploys three modules—Task-Specific Layout Generation (LLMs + layout priors), Uni-Controlled Image Generation (MIGC with style and reference guidance), and Task-Oriented Annotation (refinement and QA-style data)—to produce task-consistent content and annotations. Across Few-shot Object Detection, Cross-domain Object Detection, Zero-shot Composed Image Retrieval, and Multi-modal Perception, AnySynth demonstrates consistent performance gains and clear ablations validating each module. The framework offers practical impact by reducing data collection costs while enabling broad applicability, with future work targeting domain-specific domains like medical imaging and remote sensing.

Abstract

Diffusion models have recently been employed to generate high-quality images, reducing the need for manual data collection and improving model generalization in tasks such as object detection, instance segmentation, and image perception. However, the synthetic framework is usually designed with meticulous human effort for each task due to various requirements on image layout, content, and annotation formats, restricting the application of synthetic data on more general scenarios. In this paper, we propose AnySynth, a unified framework integrating adaptable, comprehensive, and highly controllable components capable of generating an arbitrary type of synthetic data given diverse requirements. Specifically, the Task-Specific Layout Generation Module is first introduced to produce reasonable layouts for different tasks by leveraging the generation ability of large language models and layout priors of real-world images. A Uni-Controlled Image Generation Module is then developed to create high-quality synthetic images that are controllable and based on the generated layouts. In addition, user specific reference images, and style images can be incorporated into the generation to task requirements. Finally, the Task-Oriented Annotation Module offers precise and detailed annotations for the generated images across different tasks. We have validated our framework's performance across various tasks, including Few-shot Object Detection, Cross-domain Object Detection, Zero-shot Composed Image Retrieval, and Multi-modal Image Perception and Grounding. The specific data synthesized by our framework significantly improves model performance in these tasks, demonstrating the generality and effectiveness of our framework.

Paper Structure

This paper contains 23 sections, 8 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: The Differences of AnySynth from Other Frameworks. In (a), we showcase several typical tasks with diverse requirements, in (b), we show the common Synthetic data collection frameworks, which need specific desgin on different tasks. In (c), we show our AnySynth Framework, handle diverse tasks in one unified framework, enhance the generality of synthetic data.
  • Figure 1: Overview of our system prompt, Input and output.
  • Figure 2: Overview of our AnySynth. Our AnySynth consists of three modules. (a) Represents the Task-Specific Layout Generation Module, which parses various layout parameters using LLMs and combines them with dataset statistics to derive object layouts and basic scenes. (b) Represents the Uni-Controlled Image Generation Module, which achieves comprehensive high-quality image generation by integrating layout, instance reference and style reference image, and position quality filtering. (c) Represents the Task-Oriented Annotation module, which provides fine-grained annotations for downstream tasks.
  • Figure 2: The Image generated by our Framework in ZSCIR.
  • Figure 3: Quantitative results in Few-Shot Image Classification.
  • ...and 7 more figures