On Pre-Trained Image Features and Synthetic Images for Deep Learning
Stefan Hinterstoisser, Vincent Lepetit, Paul Wohlhart, Kurt Konolige
TL;DR
The paper addresses the challenge of training deep object detectors without labor-intensive real-world labeling by exploiting synthetic data. It shows that freezing feature extractor layers pre-trained on real images and training only the remaining detector head on synthetically generated data yields performance close to models trained with real data, across Faster-RCNN, RFCN, and Mask-RCNN with InceptionResnet and ResNet101 backbones. The authors provide a minimal yet effective synthetic data pipeline using OpenGL rendering and simple scene variations, and demonstrate that the approach generalizes across objects, cameras, and architectures, while offering limited gains from finetuning frozen layers. This work significantly reduces labeling costs and demonstrates that realistic rendering is not strictly necessary for competitive detection, enabling scalable synthetic-data-driven training in practical settings.
Abstract
Deep Learning methods usually require huge amounts of training data to perform at their full potential, and often require expensive manual labeling. Using synthetic images is therefore very attractive to train object detectors, as the labeling comes for free, and several approaches have been proposed to combine synthetic and real images for training. In this paper, we show that a simple trick is sufficient to train very effectively modern object detectors with synthetic images only: We freeze the layers responsible for feature extraction to generic layers pre-trained on real images, and train only the remaining layers with plain OpenGL rendering. Our experiments with very recent deep architectures for object recognition (Faster-RCNN, R-FCN, Mask-RCNN) and image feature extractors (InceptionResnet and Resnet) show this simple approach performs surprisingly well.
