Table of Contents
Fetching ...

MemFlow: Optical Flow Estimation and Prediction with Memory

Qiaole Dong, Yanwei Fu

TL;DR

MemFlow presents a memory-augmented online optical flow framework that reads and updates a history-aware memory to leverage temporal coherence without offline multi-frame requirements. It introduces memory read-out with attention and a resolution-adaptive scaling, enabling strong cross-dataset generalization and real-time performance, while also extending to one-step-ahead flow prediction (MemFlow-P) for video synthesis workflows. The approach achieves state-of-the-art or near-SOTA results with fewer parameters and faster inference compared to heavy multi-frame methods, and demonstrates competitive flow prediction and video prediction without task-specific training. This work offers a practical, memory-driven solution for real-time optical flow and predictive motion modeling in safety-critical applications.

Abstract

Optical flow is a classical task that is important to the vision community. Classical optical flow estimation uses two frames as input, whilst some recent methods consider multiple frames to explicitly model long-range information. The former ones limit their ability to fully leverage temporal coherence along the video sequence; and the latter ones incur heavy computational overhead, typically not possible for real-time flow estimation. Some multi-frame-based approaches even necessitate unseen future frames for current estimation, compromising real-time applicability in safety-critical scenarios. To this end, we present MemFlow, a real-time method for optical flow estimation and prediction with memory. Our method enables memory read-out and update modules for aggregating historical motion information in real-time. Furthermore, we integrate resolution-adaptive re-scaling to accommodate diverse video resolutions. Besides, our approach seamlessly extends to the future prediction of optical flow based on past observations. Leveraging effective historical motion aggregation, our method outperforms VideoFlow with fewer parameters and faster inference speed on Sintel and KITTI-15 datasets in terms of generalization performance. At the time of submission, MemFlow also leads in performance on the 1080p Spring dataset. Codes and models will be available at: https://dqiaole.github.io/MemFlow/.

MemFlow: Optical Flow Estimation and Prediction with Memory

TL;DR

MemFlow presents a memory-augmented online optical flow framework that reads and updates a history-aware memory to leverage temporal coherence without offline multi-frame requirements. It introduces memory read-out with attention and a resolution-adaptive scaling, enabling strong cross-dataset generalization and real-time performance, while also extending to one-step-ahead flow prediction (MemFlow-P) for video synthesis workflows. The approach achieves state-of-the-art or near-SOTA results with fewer parameters and faster inference compared to heavy multi-frame methods, and demonstrates competitive flow prediction and video prediction without task-specific training. This work offers a practical, memory-driven solution for real-time optical flow and predictive motion modeling in safety-critical applications.

Abstract

Optical flow is a classical task that is important to the vision community. Classical optical flow estimation uses two frames as input, whilst some recent methods consider multiple frames to explicitly model long-range information. The former ones limit their ability to fully leverage temporal coherence along the video sequence; and the latter ones incur heavy computational overhead, typically not possible for real-time flow estimation. Some multi-frame-based approaches even necessitate unseen future frames for current estimation, compromising real-time applicability in safety-critical scenarios. To this end, we present MemFlow, a real-time method for optical flow estimation and prediction with memory. Our method enables memory read-out and update modules for aggregating historical motion information in real-time. Furthermore, we integrate resolution-adaptive re-scaling to accommodate diverse video resolutions. Besides, our approach seamlessly extends to the future prediction of optical flow based on past observations. Leveraging effective historical motion aggregation, our method outperforms VideoFlow with fewer parameters and faster inference speed on Sintel and KITTI-15 datasets in terms of generalization performance. At the time of submission, MemFlow also leads in performance on the 1080p Spring dataset. Codes and models will be available at: https://dqiaole.github.io/MemFlow/.
Paper Structure (17 sections, 7 equations, 17 figures, 9 tables)

This paper contains 17 sections, 7 equations, 17 figures, 9 tables.

Figures (17)

  • Figure 1: End-point-error on Sintel (clean) vs. inference time (ms) and model size (M). All models are trained on FlyingChairs and FlyingThings3D, and tested with one NVIDIA A100 GPU. MemFlow(-T) (x it) indicates running our network with only x iterations of GRU. Our MemFlow(-T) achieves significant reductions in computational overhead as well as substantial performance boosts over the state-of-the-art methods.
  • Figure 2: Overview of our MemFlow. MemFlow maintains a memory buffer to store historical motion states of video, together with an efficient update and read-out process that retrieves useful motion information for the current frame's optical flow estimation. It has three key components: 1) Feature Extractors. Feature and motion encoder extract and construct the motion feature for the current frame. Another context encoder produces the context feature. 2) Memory buffer. Memory buffer stores historical context and motion features and read-out the aggregated motion feature. 3) Update Modules. GRU updates the optical flow with a series of residual flows. And the Memory buffer is kept updating when a new frame comes.
  • Figure 3: End-point-error of optical flow vs. number of iterations during inference. This figure provides the generalization performance on Sintel (clean) training set. Our method outperforms 15-iteration SKFlow's performance, after using only 2 iterations.
  • Figure 4: Qualitative comparison on the training set of Sintel final pass after pre-training on FlyingChair and FlyingThings3D. Notable areas are marked by a bounding box. Please zoom in for details.
  • Figure 5: Qualitative comparison on test set of KITTI-15 after finetuning. Ours do much better at distinguishing between different vehicles (first row) and between the foreground and the sky (second row). Please zoom in for details.
  • ...and 12 more figures