Table of Contents
Fetching ...

InstaDA: Augmenting Instance Segmentation Data with Dual-Agent System

Xianbao Hou, Yonghao He, Zeyd Boukhers, John See, Hu Su, Wei Sui, Cong Yang

TL;DR

InstaDA introduces a training-free dual-agent framework for large-scale instance segmentation data augmentation. The T-Agent drives diverse synthetic data via a collaborative prompt generation loop between LLMs and diffusion models, while the I-Agent enriches the data distribution by conditioning images on real training samples; both operate as automated workflows with efficiency through LoRA. Key innovations include a CLIP dual-similarity filtration, a SAM-box reannotation strategy, and a Proportional Filtration scheme, which together yield substantial improvements on LVIS over baselines and prior methods. The approach demonstrates that balancing data diversity with distribution alignment is crucial for mitigating overfitting and improving generalization in long-tailed segmentation settings, offering a practical, scalable augmentation pipeline.

Abstract

Acquiring high-quality instance segmentation data is challenging due to the labor-intensive nature of the annotation process and significant class imbalances within datasets. Recent studies have utilized the integration of Copy-Paste and diffusion models to create more diverse datasets. However, these studies often lack deep collaboration between large language models (LLMs) and diffusion models, and underutilize the rich information within the existing training data. To address these limitations, we propose InstaDA, a novel, training-free Dual-Agent system designed to augment instance segmentation datasets. First, we introduce a Text-Agent (T-Agent) that enhances data diversity through collaboration between LLMs and diffusion models. This agent features a novel Prompt Rethink mechanism, which iteratively refines prompts based on the generated images. This process not only fosters collaboration but also increases image utilization and optimizes the prompts themselves. Additionally, we present an Image-Agent (I-Agent) aimed at enriching the overall data distribution. This agent augments the training set by generating new instances conditioned on the training images. To ensure practicality and efficiency, both agents operate as independent and automated workflows, enhancing usability. Experiments conducted on the LVIS 1.0 validation set indicate that InstaDA achieves significant improvements, with an increase of +4.0 in box average precision (AP) and +3.3 in mask AP compared to the baseline. Furthermore, it outperforms the leading model, DiverGen, by +0.3 in box AP and +0.1 in mask AP, with a notable +0.7 gain in box AP on common categories and mask AP gains of +0.2 on common categories and +0.5 on frequent categories.

InstaDA: Augmenting Instance Segmentation Data with Dual-Agent System

TL;DR

InstaDA introduces a training-free dual-agent framework for large-scale instance segmentation data augmentation. The T-Agent drives diverse synthetic data via a collaborative prompt generation loop between LLMs and diffusion models, while the I-Agent enriches the data distribution by conditioning images on real training samples; both operate as automated workflows with efficiency through LoRA. Key innovations include a CLIP dual-similarity filtration, a SAM-box reannotation strategy, and a Proportional Filtration scheme, which together yield substantial improvements on LVIS over baselines and prior methods. The approach demonstrates that balancing data diversity with distribution alignment is crucial for mitigating overfitting and improving generalization in long-tailed segmentation settings, offering a practical, scalable augmentation pipeline.

Abstract

Acquiring high-quality instance segmentation data is challenging due to the labor-intensive nature of the annotation process and significant class imbalances within datasets. Recent studies have utilized the integration of Copy-Paste and diffusion models to create more diverse datasets. However, these studies often lack deep collaboration between large language models (LLMs) and diffusion models, and underutilize the rich information within the existing training data. To address these limitations, we propose InstaDA, a novel, training-free Dual-Agent system designed to augment instance segmentation datasets. First, we introduce a Text-Agent (T-Agent) that enhances data diversity through collaboration between LLMs and diffusion models. This agent features a novel Prompt Rethink mechanism, which iteratively refines prompts based on the generated images. This process not only fosters collaboration but also increases image utilization and optimizes the prompts themselves. Additionally, we present an Image-Agent (I-Agent) aimed at enriching the overall data distribution. This agent augments the training set by generating new instances conditioned on the training images. To ensure practicality and efficiency, both agents operate as independent and automated workflows, enhancing usability. Experiments conducted on the LVIS 1.0 validation set indicate that InstaDA achieves significant improvements, with an increase of +4.0 in box average precision (AP) and +3.3 in mask AP compared to the baseline. Furthermore, it outperforms the leading model, DiverGen, by +0.3 in box AP and +0.1 in mask AP, with a notable +0.7 gain in box AP on common categories and mask AP gains of +0.2 on common categories and +0.5 on frequent categories.

Paper Structure

This paper contains 39 sections, 4 equations, 5 figures, 12 tables, 1 algorithm.

Figures (5)

  • Figure 1: Examples of GDDE proposed by DiverGen fan2024divergen and our T-Agent. Both sets of images are generated by Flux with identical settings. (a) GDDE creates instances with limited visual diversity. (b) The T-Agent generates a significantly more diverse set of instances, showcasing superior variation in object appearance.
  • Figure 2: Overview of the InstaDA pipeline. Our dual-agent system consists of two parallel workflows. The T-Agent (top) first generates images from diverse prompts to enhance data diversity. These instances are then segmented using BiRefNet and filtered by our CLIP dual-similarity metric. Low-quality instances trigger the Prompt Rethink mechanism to refine the initial prompts. These polished instances create a separate synthetic pool for augmentation. In parallel, the I-Agent (bottom) generates instances conditioned on the training images to enrich the overall data distribution. For these instances, our SAM-box strategy is applied for precise annotation, followed by a proportional filtration strategy based on CLIP scores to ensure high quality. These instances form an additional augmentation pool. Finally, instances from both pools are jointly used for Copy-Paste.
  • Figure 3: Examples of our proposed I-Agent. Conditioning on a source image (top), our I-Agent generates augmented data (bottom) with subtle yet meaningful alterations to object details. Close inspection reveals these nuanced changes, which serve to enrich the overall data distribution.
  • Figure 4: Visualization of data distribution on generated data and LVIS dataset. This UMAP visualization illustrates that excessive volume of generated data creates a suboptimal data distribution relative to the LVIS validation set, leading to degraded model performance.
  • Figure 5: Examples of SAM-bg and BiRefNet. BiRefNet produces more precise segmentation masks than SAM-bg, demonstrating superior performance on generated images with both simple and complex backgrounds.