Table of Contents
Fetching ...

Event-Based Video Frame Interpolation With Cross-Modal Asymmetric Bidirectional Motion Fields

Taewoo Kim, Yujeong Chae, Hyun-Kurl Jang, Kuk-Jin Yoon

TL;DR

The paper tackles video frame interpolation by leveraging cross-modal information from event streams and RGB frames to handle complex, non-linear motions. It introduces EIF-BiOFNet to directly estimate asymmetric inter-frame motion fields $V_{t \rightarrow 0}$ and $V_{t \rightarrow 1}$ from both modalities and an Interactive Attention-based Frame Synthesis network to fuse warping-based and synthesis-based features for accurate $I_t$ reconstruction. It also introduces ERF-X170FPS, a high-frame-rate dataset captured with a beam-splitter rig to cover extreme motions and dynamic textures. Across synthetic and real benchmarks, the proposed method delivers state-of-the-art PSNR/SSIM gains (e.g., up to ~8.2dB PSNR on GoPro and ~7.9dB over TimeLens on ERF-X170FPS) with competitive model efficiency, demonstrating the value of cross-modal motion-field estimation and transformer-based frame synthesis for event-based VFI.

Abstract

Video Frame Interpolation (VFI) aims to generate intermediate video frames between consecutive input frames. Since the event cameras are bio-inspired sensors that only encode brightness changes with a micro-second temporal resolution, several works utilized the event camera to enhance the performance of VFI. However, existing methods estimate bidirectional inter-frame motion fields with only events or approximations, which can not consider the complex motion in real-world scenarios. In this paper, we propose a novel event-based VFI framework with cross-modal asymmetric bidirectional motion field estimation. In detail, our EIF-BiOFNet utilizes each valuable characteristic of the events and images for direct estimation of inter-frame motion fields without any approximation methods. Moreover, we develop an interactive attention-based frame synthesis network to efficiently leverage the complementary warping-based and synthesis-based features. Finally, we build a large-scale event-based VFI dataset, ERF-X170FPS, with a high frame rate, extreme motion, and dynamic textures to overcome the limitations of previous event-based VFI datasets. Extensive experimental results validate that our method shows significant performance improvement over the state-of-the-art VFI methods on various datasets. Our project pages are available at: https://github.com/intelpro/CBMNet

Event-Based Video Frame Interpolation With Cross-Modal Asymmetric Bidirectional Motion Fields

TL;DR

The paper tackles video frame interpolation by leveraging cross-modal information from event streams and RGB frames to handle complex, non-linear motions. It introduces EIF-BiOFNet to directly estimate asymmetric inter-frame motion fields and from both modalities and an Interactive Attention-based Frame Synthesis network to fuse warping-based and synthesis-based features for accurate reconstruction. It also introduces ERF-X170FPS, a high-frame-rate dataset captured with a beam-splitter rig to cover extreme motions and dynamic textures. Across synthetic and real benchmarks, the proposed method delivers state-of-the-art PSNR/SSIM gains (e.g., up to ~8.2dB PSNR on GoPro and ~7.9dB over TimeLens on ERF-X170FPS) with competitive model efficiency, demonstrating the value of cross-modal motion-field estimation and transformer-based frame synthesis for event-based VFI.

Abstract

Video Frame Interpolation (VFI) aims to generate intermediate video frames between consecutive input frames. Since the event cameras are bio-inspired sensors that only encode brightness changes with a micro-second temporal resolution, several works utilized the event camera to enhance the performance of VFI. However, existing methods estimate bidirectional inter-frame motion fields with only events or approximations, which can not consider the complex motion in real-world scenarios. In this paper, we propose a novel event-based VFI framework with cross-modal asymmetric bidirectional motion field estimation. In detail, our EIF-BiOFNet utilizes each valuable characteristic of the events and images for direct estimation of inter-frame motion fields without any approximation methods. Moreover, we develop an interactive attention-based frame synthesis network to efficiently leverage the complementary warping-based and synthesis-based features. Finally, we build a large-scale event-based VFI dataset, ERF-X170FPS, with a high frame rate, extreme motion, and dynamic textures to overcome the limitations of previous event-based VFI datasets. Extensive experimental results validate that our method shows significant performance improvement over the state-of-the-art VFI methods on various datasets. Our project pages are available at: https://github.com/intelpro/CBMNet

Paper Structure

This paper contains 29 sections, 4 equations, 17 figures, 11 tables.

Figures (17)

  • Figure 1: Qualitative comparison on the warped frame of inter-frame motion fields. (b) and (c) estimate symmetrical inter-frame motion fields. (d) and (e) estimate asymmetric motion fields using only images and events, respectively. (f) Ours shows the best results using cross-modal asymmetric bidirectional motion fields.
  • Figure 2: The overall architecture of Interactive Attention-based frame synthesis network.
  • Figure 3: The network architecture of proposed EIF-BiOFNet in scale $s$. For brevity, we only depict unidirectional motion field among the bidirectional motion fields in the F-BiOF module.
  • Figure 4: The proposed Interactive Attention Module. © denotes concatenation operation along the channel dimension.
  • Figure 5: Visual results on the proposed ERF-X170FPS dataset. (Best viewed when zoomed in.)
  • ...and 12 more figures