Table of Contents
Fetching ...

On Pre-Trained Image Features and Synthetic Images for Deep Learning

Stefan Hinterstoisser, Vincent Lepetit, Paul Wohlhart, Kurt Konolige

TL;DR

The paper addresses the challenge of training deep object detectors without labor-intensive real-world labeling by exploiting synthetic data. It shows that freezing feature extractor layers pre-trained on real images and training only the remaining detector head on synthetically generated data yields performance close to models trained with real data, across Faster-RCNN, RFCN, and Mask-RCNN with InceptionResnet and ResNet101 backbones. The authors provide a minimal yet effective synthetic data pipeline using OpenGL rendering and simple scene variations, and demonstrate that the approach generalizes across objects, cameras, and architectures, while offering limited gains from finetuning frozen layers. This work significantly reduces labeling costs and demonstrates that realistic rendering is not strictly necessary for competitive detection, enabling scalable synthetic-data-driven training in practical settings.

Abstract

Deep Learning methods usually require huge amounts of training data to perform at their full potential, and often require expensive manual labeling. Using synthetic images is therefore very attractive to train object detectors, as the labeling comes for free, and several approaches have been proposed to combine synthetic and real images for training. In this paper, we show that a simple trick is sufficient to train very effectively modern object detectors with synthetic images only: We freeze the layers responsible for feature extraction to generic layers pre-trained on real images, and train only the remaining layers with plain OpenGL rendering. Our experiments with very recent deep architectures for object recognition (Faster-RCNN, R-FCN, Mask-RCNN) and image feature extractors (InceptionResnet and Resnet) show this simple approach performs surprisingly well.

On Pre-Trained Image Features and Synthetic Images for Deep Learning

TL;DR

The paper addresses the challenge of training deep object detectors without labor-intensive real-world labeling by exploiting synthetic data. It shows that freezing feature extractor layers pre-trained on real images and training only the remaining detector head on synthetically generated data yields performance close to models trained with real data, across Faster-RCNN, RFCN, and Mask-RCNN with InceptionResnet and ResNet101 backbones. The authors provide a minimal yet effective synthetic data pipeline using OpenGL rendering and simple scene variations, and demonstrate that the approach generalizes across objects, cameras, and architectures, while offering limited gains from finetuning frozen layers. This work significantly reduces labeling costs and demonstrates that realistic rendering is not strictly necessary for competitive detection, enabling scalable synthetic-data-driven training in practical settings.

Abstract

Deep Learning methods usually require huge amounts of training data to perform at their full potential, and often require expensive manual labeling. Using synthetic images is therefore very attractive to train object detectors, as the labeling comes for free, and several approaches have been proposed to combine synthetic and real images for training. In this paper, we show that a simple trick is sufficient to train very effectively modern object detectors with synthetic images only: We freeze the layers responsible for feature extraction to generic layers pre-trained on real images, and train only the remaining layers with plain OpenGL rendering. Our experiments with very recent deep architectures for object recognition (Faster-RCNN, R-FCN, Mask-RCNN) and image feature extractors (InceptionResnet and Resnet) show this simple approach performs surprisingly well.

Paper Structure

This paper contains 15 sections, 13 figures, 1 table.

Figures (13)

  • Figure 1: We show that feature extractor layers from modern object detectors pre-trained on real images can be used on synthetic images to learn to detect objects in real images. The top-left image shows the CAD model we used to learn to detect the object in the three other images.
  • Figure 2: The architectures of two recent object detectors with their feature extractors isolated as described in Huang17 (Figure taken from Huang17).
  • Figure 3: Our synthetic data generation pipeline. For each generated 3D pose and object, we render the object over a randomly selected cluttered background image using OpenGL and the Phong illumination model Phong75. We use randomly perturbed light color for rendering and add image noise to the rendering. Finally, we blur the object with a Gaussian filter. We also compute a tightly fitting bounding box using the object's CAD model and the corresponding pose.
  • Figure 4: (a) The real objects used in our experiments and (b) their CAD models. We chose our objects carefully to represent different colors and 3D shapes and to cover different fields of applications (industrial objects, household objects, toys).
  • Figure 5: The effect of freezing the pre-trained feature extractor, for two different cameras. Training the feature extractors on synthetic images performs poorly, and totally fails in the case of the AsusXtionPROLive camera. When using feature extractors pre-trained on real images without retraining them, the performances of detectors trained on synthetic data are almost as good as when training them on real data, except when ResNet101 is used with images from the AsusXtionPROLive camera.
  • ...and 8 more figures