Table of Contents
Fetching ...

Transfer learning with generative models for object detection on limited datasets

Matteo Paiano, Stefano Martina, Carlotta Giannelli, Filippo Caruso

TL;DR

This work proposes a transfer learning framework that mitigates the labor-intensive task of manual labeling the images in object detection tasks, and finds that it is not necessary to fine tune the generative model on the specific domain of interest.

Abstract

The availability of data is limited in some fields, especially for object detection tasks, where it is necessary to have correctly labeled bounding boxes around each object. A notable example of such data scarcity is found in the domain of marine biology, where it is useful to develop methods to automatically detect submarine species for environmental monitoring. To address this data limitation, the state-of-the-art machine learning strategies employ two main approaches. The first involves pretraining models on existing datasets before generalizing to the specific domain of interest. The second strategy is to create synthetic datasets specifically tailored to the target domain using methods like copy-paste techniques or ad-hoc simulators. The first strategy often faces a significant domain shift, while the second demands custom solutions crafted for the specific task. In response to these challenges, here we propose a transfer learning framework that is valid for a generic scenario. In this framework, generated images help to improve the performances of an object detector in a few-real data regime. This is achieved through a diffusion-based generative model that was pretrained on large generic datasets. With respect to the state-of-the-art, we find that it is not necessary to fine tune the generative model on the specific domain of interest. We believe that this is an important advance because it mitigates the labor-intensive task of manual labeling the images in object detection tasks. We validate our approach focusing on fishes in an underwater environment, and on the more common domain of cars in an urban setting. Our method achieves detection performance comparable to models trained on thousands of images, using only a few hundreds of input data. Our results pave the way for new generative AI-based protocols for machine learning applications in various domains.

Transfer learning with generative models for object detection on limited datasets

TL;DR

This work proposes a transfer learning framework that mitigates the labor-intensive task of manual labeling the images in object detection tasks, and finds that it is not necessary to fine tune the generative model on the specific domain of interest.

Abstract

The availability of data is limited in some fields, especially for object detection tasks, where it is necessary to have correctly labeled bounding boxes around each object. A notable example of such data scarcity is found in the domain of marine biology, where it is useful to develop methods to automatically detect submarine species for environmental monitoring. To address this data limitation, the state-of-the-art machine learning strategies employ two main approaches. The first involves pretraining models on existing datasets before generalizing to the specific domain of interest. The second strategy is to create synthetic datasets specifically tailored to the target domain using methods like copy-paste techniques or ad-hoc simulators. The first strategy often faces a significant domain shift, while the second demands custom solutions crafted for the specific task. In response to these challenges, here we propose a transfer learning framework that is valid for a generic scenario. In this framework, generated images help to improve the performances of an object detector in a few-real data regime. This is achieved through a diffusion-based generative model that was pretrained on large generic datasets. With respect to the state-of-the-art, we find that it is not necessary to fine tune the generative model on the specific domain of interest. We believe that this is an important advance because it mitigates the labor-intensive task of manual labeling the images in object detection tasks. We validate our approach focusing on fishes in an underwater environment, and on the more common domain of cars in an urban setting. Our method achieves detection performance comparable to models trained on thousands of images, using only a few hundreds of input data. Our results pave the way for new generative AI-based protocols for machine learning applications in various domains.
Paper Structure (13 sections, 5 equations, 7 figures, 1 table)

This paper contains 13 sections, 5 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Transfer learning for object detection with generative models. We employ a l2i pretrained model to generate images for transfer learning to an object detector. We can filter out suboptimal generated images based on benchmark metrics. For instance, the image along the red arrow is discarded because the generative model has depicted many cars outside the bounding boxes designated in the grounding instruction. With the remaining generated images, we pretrain the object detector, followed by a fine-tuning on the real dataset. Dashed lines indicate the data used for training the models.
  • Figure 2: We calculate iou as defined in (a) to evaluate the overlap (b) between the ground truth (in green) and the predicted boxes (in red). This is used to implement a Precision-Recall filter to automatically identify faithful representations (c) or discrepancies (d, e) between the intended ground truth for generation and the actual generated images.
  • Figure 3: Fish instances are positioned in identical bounding boxes using three different methods: employing as the GLIGEN grounding entities the text phrase "a fish" (a) or an image of a real fish, as shown in the box in the top right corner of (b); pasting DeepFish masks onto an OzFish background (c).
  • Figure 4: Results of the object detection map evaluated on the same NuImages test set, illustrating the impact of pretraining with varying numbers of generated images (x-axis) and subsequent training with different quantities of real images from the NuImages training set (curve color and marker shape). Solid lines with filled markers represent results with pretraining on filtered images, while dashed lines with empty markers depict pretraining on unfiltered images (in the right part of the legend we report only one marker shape for compactness).
  • Figure 5: Object detection map results on the OzFish real test set, for models pretrained on different quantities of unfiltered generated images (x-axis) and fine-tuned on varying numbers of OzFish training images (specified by the curve color). Dotted curves with round markers indicate models pretrained on synthetic copy-paste images. Dot-dashed and dashed lines, with squares and diamond markers, represent the use of images and text as grounding entities, respectively. The cyan constant solid line at around $0.6$map reports for reference the performance of a model pretrained on COCO and fine-tuned on all the 1 500 OzFish training images. The colored solid lines with triangular markers are references that use a standard data augmentation approach.
  • ...and 2 more figures