Table of Contents
Fetching ...

Towards Continual Expansion of Data Coverage: Automatic Text-guided Edge-case Synthesis

Kyeongryeol Go

TL;DR

This work tackles dataset bias by automating edge-case data synthesis through caption-based prompting. It introduces a pipeline that uses a preference-tuned LLM and Direct Preference Optimization to generate edge-focused prompts for a text-to-image model, guided by edge-ness measured via a pre-trained detector and pseudo-labeler. The approach augments the training data iteratively to expand coverage of challenging scenarios, demonstrated on FishEye8K where it improves robustness beyond naive or manually engineered prompts and transfers across model scales. The results suggest a scalable, data-centric path toward more reliable vision systems.

Abstract

The performance of deep neural networks is strongly influenced by the quality of their training data. However, mitigating dataset bias by manually curating challenging edge cases remains a major bottleneck. To address this, we propose an automated pipeline for text-guided edge-case synthesis. Our approach employs a Large Language Model, fine-tuned via preference learning, to rephrase image captions into diverse textual prompts that steer a Text-to-Image model toward generating difficult visual scenarios. Evaluated on the FishEye8K object detection benchmark, our method achieves superior robustness, surpassing both naive augmentation and manually engineered prompts. This work establishes a scalable framework that shifts data curation from manual effort to automated, targeted synthesis, offering a promising direction for developing more reliable and continuously improving AI systems. Code is available at https://github.com/gokyeongryeol/ATES.

Towards Continual Expansion of Data Coverage: Automatic Text-guided Edge-case Synthesis

TL;DR

This work tackles dataset bias by automating edge-case data synthesis through caption-based prompting. It introduces a pipeline that uses a preference-tuned LLM and Direct Preference Optimization to generate edge-focused prompts for a text-to-image model, guided by edge-ness measured via a pre-trained detector and pseudo-labeler. The approach augments the training data iteratively to expand coverage of challenging scenarios, demonstrated on FishEye8K where it improves robustness beyond naive or manually engineered prompts and transfers across model scales. The results suggest a scalable, data-centric path toward more reliable vision systems.

Abstract

The performance of deep neural networks is strongly influenced by the quality of their training data. However, mitigating dataset bias by manually curating challenging edge cases remains a major bottleneck. To address this, we propose an automated pipeline for text-guided edge-case synthesis. Our approach employs a Large Language Model, fine-tuned via preference learning, to rephrase image captions into diverse textual prompts that steer a Text-to-Image model toward generating difficult visual scenarios. Evaluated on the FishEye8K object detection benchmark, our method achieves superior robustness, surpassing both naive augmentation and manually engineered prompts. This work establishes a scalable framework that shifts data curation from manual effort to automated, targeted synthesis, offering a promising direction for developing more reliable and continuously improving AI systems. Code is available at https://github.com/gokyeongryeol/ATES.

Paper Structure

This paper contains 46 sections, 2 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Training pipeline of the rephrasing LLM. Best viewed in color.
  • Figure 2: Data augmentation pipeline by inference of the preference-tuned LLM.
  • Figure 3: Number of images per camera ID and time-of-day in FishEye8K. The dataset is split into train-D, train-R, and test sets based on camera IDs, intentionally creating different levels of bias across splits.
  • Figure 4: Filtered ground-truth annotations and predictions to compute mAP w/o TP.
  • Figure 5: Comparison of naive, manual, and automatic by UMAP mcinnes2018umap-software of CLIP radford2021learning embeddings. Gray points represent real data, while the colored ones indicate synthetic data, with colors closer to yellow denoting higher density.
  • ...and 1 more figures