Table of Contents
Fetching ...

EV-FlowNet: Self-Supervised Optical Flow Estimation for Event-based Cameras

Alex Zihao Zhu, Liangzhe Yuan, Kenneth Chaney, Kostas Daniilidis

TL;DR

<3-5 sentence high-level summary> EV-FlowNet introduces a self-supervised approach to estimate optical flow from event-based cameras by converting asynchronous events into a fixed four-channel image and leveraging synchronized grayscale frames as supervision. The method uses a CNN in an encoder-decoder configuration to predict dense optical flow, trained with a photometric loss and a smoothness prior, without ground-truth flow annotations. A new MVSEC-derived dataset enables evaluation of event-based optical flow, showing competitive performance against frame-based self-supervised methods like UnFlow and robustness across different scenes. The work also provides an image-based event representation that can transfer self-supervised learning techniques from frames to event-data domains, and it outlines future directions for stronger event-only supervision and broader datasets.

Abstract

Event-based cameras have shown great promise in a variety of situations where frame based cameras suffer, such as high speed motions and high dynamic range scenes. However, developing algorithms for event measurements requires a new class of hand crafted algorithms. Deep learning has shown great success in providing model free solutions to many problems in the vision community, but existing networks have been developed with frame based images in mind, and there does not exist the wealth of labeled data for events as there does for images for supervised training. To these points, we present EV-FlowNet, a novel self-supervised deep learning pipeline for optical flow estimation for event based cameras. In particular, we introduce an image based representation of a given event stream, which is fed into a self-supervised neural network as the sole input. The corresponding grayscale images captured from the same camera at the same time as the events are then used as a supervisory signal to provide a loss function at training time, given the estimated flow from the network. We show that the resulting network is able to accurately predict optical flow from events only in a variety of different scenes, with performance competitive to image based networks. This method not only allows for accurate estimation of dense optical flow, but also provides a framework for the transfer of other self-supervised methods to the event-based domain.

EV-FlowNet: Self-Supervised Optical Flow Estimation for Event-based Cameras

TL;DR

<3-5 sentence high-level summary> EV-FlowNet introduces a self-supervised approach to estimate optical flow from event-based cameras by converting asynchronous events into a fixed four-channel image and leveraging synchronized grayscale frames as supervision. The method uses a CNN in an encoder-decoder configuration to predict dense optical flow, trained with a photometric loss and a smoothness prior, without ground-truth flow annotations. A new MVSEC-derived dataset enables evaluation of event-based optical flow, showing competitive performance against frame-based self-supervised methods like UnFlow and robustness across different scenes. The work also provides an image-based event representation that can transfer self-supervised learning techniques from frames to event-data domains, and it outlines future directions for stronger event-only supervision and broader datasets.

Abstract

Event-based cameras have shown great promise in a variety of situations where frame based cameras suffer, such as high speed motions and high dynamic range scenes. However, developing algorithms for event measurements requires a new class of hand crafted algorithms. Deep learning has shown great success in providing model free solutions to many problems in the vision community, but existing networks have been developed with frame based images in mind, and there does not exist the wealth of labeled data for events as there does for images for supervised training. To these points, we present EV-FlowNet, a novel self-supervised deep learning pipeline for optical flow estimation for event based cameras. In particular, we introduce an image based representation of a given event stream, which is fed into a self-supervised neural network as the sole input. The corresponding grayscale images captured from the same camera at the same time as the events are then used as a supervisory signal to provide a loss function at training time, given the estimated flow from the network. We show that the resulting network is able to accurately predict optical flow from events only in a variety of different scenes, with performance competitive to image based networks. This method not only allows for accurate estimation of dense optical flow, but also provides a framework for the transfer of other self-supervised methods to the event-based domain.

Paper Structure

This paper contains 21 sections, 7 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Left: Event input to the network visualizing the last two channels (latest timestamps). Right: Predicted flow, colored by direction. Best viewed in color.
  • Figure 2: Example of a timestamp image. Left: Grayscale output. Right: Timestamp image, where each pixel represents the timestamp of the most recent event. Brighter is more recent.
  • Figure 3: EV-FlowNet architecture. The event input is downsampled through four encoder (strided convolution) layers, before being passed through two residual block layers. The activations are then passed through four decoder (upsample convolution) layers, with skip connections to the corresponding encoder layer. In addition, each set of decoder activations is passed through another depthwise convolution layer to generate a flow prediction at its resolution. A loss is applied to this flow prediction, and the prediction is also concatenated to the decoder activations. Best viewed in color.
  • Figure 4: Qualitative results from evaluation. Examples were collected from outdoor$\_$day1, outdoor$\_$day1, indoor$\_$flying1 and indoor$\_$flying2, in that order. Best viewed in color.
  • Figure 5: Common failure case, where fast motion causes recent timestamps to overwrite older pixels nearby, resulting in incorrect predictions. Best viewed in color.