Table of Contents
Fetching ...

ImaginaryNet: Learning Object Detectors without Real Images and Annotations

Minheng Ni, Zitong Huang, Kailai Feng, Wangmeng Zuo

TL;DR

ISOD tackles object detection without real images or annotations by synthesizing labeled training data through a language-guided image generator. ImaginaryNet leverages GPT-2 for scene descriptions and a text-to-image model to render images, enabling WSOD-based detector training and providing complementary gains when real data is available. On VOC2007 and MSCOCO, ISOD achieves substantial performance gains over a CLIP baseline and approaches a significant fraction of WSOD performance, while jointly training with real-data supervision yields state-of-the-art results. The work demonstrates a practical pathway for low-resource detection and dataset augmentation via language-guided synthetic imagery and weakly supervised learning.

Abstract

Without the demand of training in reality, humans can easily detect a known concept simply based on its language description. Empowering deep learning with this ability undoubtedly enables the neural network to handle complex vision tasks, e.g., object detection, without collecting and annotating real images. To this end, this paper introduces a novel challenging learning paradigm Imaginary-Supervised Object Detection (ISOD), where neither real images nor manual annotations are allowed for training object detectors. To resolve this challenge, we propose ImaginaryNet, a framework to synthesize images by combining pretrained language model and text-to-image synthesis model. Given a class label, the language model is used to generate a full description of a scene with a target object, and the text-to-image model deployed to generate a photo-realistic image. With the synthesized images and class labels, weakly supervised object detection can then be leveraged to accomplish ISOD. By gradually introducing real images and manual annotations, ImaginaryNet can collaborate with other supervision settings to further boost detection performance. Experiments show that ImaginaryNet can (i) obtain about 70% performance in ISOD compared with the weakly supervised counterpart of the same backbone trained on real data, (ii) significantly improve the baseline while achieving state-of-the-art or comparable performance by incorporating ImaginaryNet with other supervision settings.

ImaginaryNet: Learning Object Detectors without Real Images and Annotations

TL;DR

ISOD tackles object detection without real images or annotations by synthesizing labeled training data through a language-guided image generator. ImaginaryNet leverages GPT-2 for scene descriptions and a text-to-image model to render images, enabling WSOD-based detector training and providing complementary gains when real data is available. On VOC2007 and MSCOCO, ISOD achieves substantial performance gains over a CLIP baseline and approaches a significant fraction of WSOD performance, while jointly training with real-data supervision yields state-of-the-art results. The work demonstrates a practical pathway for low-resource detection and dataset augmentation via language-guided synthetic imagery and weakly supervised learning.

Abstract

Without the demand of training in reality, humans can easily detect a known concept simply based on its language description. Empowering deep learning with this ability undoubtedly enables the neural network to handle complex vision tasks, e.g., object detection, without collecting and annotating real images. To this end, this paper introduces a novel challenging learning paradigm Imaginary-Supervised Object Detection (ISOD), where neither real images nor manual annotations are allowed for training object detectors. To resolve this challenge, we propose ImaginaryNet, a framework to synthesize images by combining pretrained language model and text-to-image synthesis model. Given a class label, the language model is used to generate a full description of a scene with a target object, and the text-to-image model deployed to generate a photo-realistic image. With the synthesized images and class labels, weakly supervised object detection can then be leveraged to accomplish ISOD. By gradually introducing real images and manual annotations, ImaginaryNet can collaborate with other supervision settings to further boost detection performance. Experiments show that ImaginaryNet can (i) obtain about 70% performance in ISOD compared with the weakly supervised counterpart of the same backbone trained on real data, (ii) significantly improve the baseline while achieving state-of-the-art or comparable performance by incorporating ImaginaryNet with other supervision settings.
Paper Structure (33 sections, 3 equations, 6 figures, 9 tables)

This paper contains 33 sections, 3 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Overview of ImaginaryNet.ImaginaryNet samples the class label randomly and fulfills it to the prefix template. The language model extends the prefix to a complete description. The synthesis model generates imaginary images from random noise based on the description. Proposal representations are extracted from imaginary images. ImaginaryNet optimizes Detection Head with proposal representations and class labels. If real data exists, Detection Head will also be optimized based on representations from real images and manual annotations.
  • Figure 2: The structure of Detection Head. Based on different data settings, we use three types of detection heads. In (a) type of Detection Head, no real images and annotations will participate the training process. In types (b) and (c), real images and annotations will be trained with imaginary data together. In type (c), MIL Branch will be disabled due to box supervision exists.
  • Figure 3: Visualization of extracted features. We can observe that features can cluster based on the object type. This shows imaginary data contains similar knowledge like real data.
  • Figure 4: Overall results of different total imaginary samples. The performance improves steadily with the growth of imaginary samples.
  • Figure 5: Visualization of imaginary images. Language model extends the prefix to a full description of the scene and the synthesis model generated images followed the description.
  • ...and 1 more figures