Table of Contents
Fetching ...

FlowNet: Learning Optical Flow with Convolutional Networks

Philipp Fischer, Alexey Dosovitskiy, Eddy Ilg, Philip Häusser, Caner Hazırbaş, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, Thomas Brox

TL;DR

The paper demonstrates that convolutional neural networks can be trained end-to-end to predict dense optical flow directly from image pairs, introducing FlowNetS and FlowNetC architectures with a correlation layer to support cross-image matching. A synthetic Flying Chairs dataset, coupled with online data augmentation, enables scalable training that generalizes surprisingly well to real datasets such as Sintel and KITTI, sometimes outperforming traditional real-time methods. Refinement strategies, including upconvolution and optional variational post-processing, offer improved smoothness and accuracy, while real-time GPU implementation yields up to 10 fps. The work highlights the potential of learned motion representations to rival hand-crafted optical-flow methods and points to future gains from more realistic training data and enhanced handling of large displacements.

Abstract

Convolutional neural networks (CNNs) have recently been very successful in a variety of computer vision tasks, especially on those linked to recognition. Optical flow estimation has not been among the tasks where CNNs were successful. In this paper we construct appropriate CNNs which are capable of solving the optical flow estimation problem as a supervised learning task. We propose and compare two architectures: a generic architecture and another one including a layer that correlates feature vectors at different image locations. Since existing ground truth data sets are not sufficiently large to train a CNN, we generate a synthetic Flying Chairs dataset. We show that networks trained on this unrealistic data still generalize very well to existing datasets such as Sintel and KITTI, achieving competitive accuracy at frame rates of 5 to 10 fps.

FlowNet: Learning Optical Flow with Convolutional Networks

TL;DR

The paper demonstrates that convolutional neural networks can be trained end-to-end to predict dense optical flow directly from image pairs, introducing FlowNetS and FlowNetC architectures with a correlation layer to support cross-image matching. A synthetic Flying Chairs dataset, coupled with online data augmentation, enables scalable training that generalizes surprisingly well to real datasets such as Sintel and KITTI, sometimes outperforming traditional real-time methods. Refinement strategies, including upconvolution and optional variational post-processing, offer improved smoothness and accuracy, while real-time GPU implementation yields up to 10 fps. The work highlights the potential of learned motion representations to rival hand-crafted optical-flow methods and points to future gains from more realistic training data and enhanced handling of large displacements.

Abstract

Convolutional neural networks (CNNs) have recently been very successful in a variety of computer vision tasks, especially on those linked to recognition. Optical flow estimation has not been among the tasks where CNNs were successful. In this paper we construct appropriate CNNs which are capable of solving the optical flow estimation problem as a supervised learning task. We propose and compare two architectures: a generic architecture and another one including a layer that correlates feature vectors at different image locations. Since existing ground truth data sets are not sufficiently large to train a CNN, we generate a synthetic Flying Chairs dataset. We show that networks trained on this unrealistic data still generalize very well to existing datasets such as Sintel and KITTI, achieving competitive accuracy at frame rates of 5 to 10 fps.

Paper Structure

This paper contains 26 sections, 2 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: We present neural networks which learn to estimate optical flow, being trained end-to-end. The information is first spatially compressed in a contractive part of the network and then refined in an expanding part.
  • Figure 1: Flow field color coding. The central pixel does not move, and the displacement of every other pixel is the vector from the center to this pixel.
  • Figure 2: The two network architectures: FlowNetSimple (top) and FlowNetCorr (bottom).
  • Figure 2: Histogram of displacement distribution in Sintel (left) and Flying Chairs (right) with linear (top) and logarithmic (bottom) y-axis. The distribution was cut off at the displacement of 150 pixels, the maximum flow in Sintel is actually around 450 pixels.
  • Figure 3: Refinement of the coarse feature maps to the high resolution prediction.
  • ...and 7 more figures