Table of Contents
Fetching ...

BlinkVision: A Benchmark for Optical Flow, Scene Flow and Point Tracking Estimation using RGB Frames and Events

Yijin Li, Yichen Shen, Zhaoyang Huang, Shuo Chen, Weikang Bian, Xiaoyu Shi, Fu-Yun Wang, Keqiang Sun, Hujun Bao, Zhaopeng Cui, Guofeng Zhang, Hongsheng Li

TL;DR

BlinkVision addresses the lack of a unified benchmark for pixel-wise correspondence that combines RGB frames and event data across optical flow, scene flow, and point tracking. It introduces a photorealistic, multi-modality dataset with dense per-pixel ground truth over 410 categories, rendered with Blender and accompanied by a public leaderboard. The study shows that current image- and event-based methods struggle under large frame gaps and extreme lighting, though fine-tuning on BlinkVision improves generalization and highlights the dataset’s value for cross-modal research. By enabling category-aware analysis and cross-dataset transfer, BlinkVision promises to accelerate the development of robust, multi-modal vision systems.

Abstract

Recent advances in event-based vision suggest that these systems complement traditional cameras by providing continuous observation without frame rate limitations and a high dynamic range, making them well-suited for correspondence tasks such as optical flow and point tracking. However, there is still a lack of comprehensive benchmarks for correspondence tasks that include both event data and images. To address this gap, we propose BlinkVision, a large-scale and diverse benchmark with multiple modalities and dense correspondence annotations. BlinkVision offers several valuable features: 1) Rich modalities: It includes both event data and RGB images. 2) Extensive annotations: It provides dense per-pixel annotations covering optical flow, scene flow, and point tracking. 3) Large vocabulary: It contains 410 everyday categories, sharing common classes with popular 2D and 3D datasets like LVIS and ShapeNet. 4) Naturalistic: It delivers photorealistic data and covers various naturalistic factors, such as camera shake and deformation. BlinkVision enables extensive benchmarks on three types of correspondence tasks (optical flow, point tracking, and scene flow estimation) for both image-based and event-based methods, offering new observations, practices, and insights for future research. The benchmark website is https://www.blinkvision.net/.

BlinkVision: A Benchmark for Optical Flow, Scene Flow and Point Tracking Estimation using RGB Frames and Events

TL;DR

BlinkVision addresses the lack of a unified benchmark for pixel-wise correspondence that combines RGB frames and event data across optical flow, scene flow, and point tracking. It introduces a photorealistic, multi-modality dataset with dense per-pixel ground truth over 410 categories, rendered with Blender and accompanied by a public leaderboard. The study shows that current image- and event-based methods struggle under large frame gaps and extreme lighting, though fine-tuning on BlinkVision improves generalization and highlights the dataset’s value for cross-modal research. By enabling category-aware analysis and cross-dataset transfer, BlinkVision promises to accelerate the development of robust, multi-modal vision systems.

Abstract

Recent advances in event-based vision suggest that these systems complement traditional cameras by providing continuous observation without frame rate limitations and a high dynamic range, making them well-suited for correspondence tasks such as optical flow and point tracking. However, there is still a lack of comprehensive benchmarks for correspondence tasks that include both event data and images. To address this gap, we propose BlinkVision, a large-scale and diverse benchmark with multiple modalities and dense correspondence annotations. BlinkVision offers several valuable features: 1) Rich modalities: It includes both event data and RGB images. 2) Extensive annotations: It provides dense per-pixel annotations covering optical flow, scene flow, and point tracking. 3) Large vocabulary: It contains 410 everyday categories, sharing common classes with popular 2D and 3D datasets like LVIS and ShapeNet. 4) Naturalistic: It delivers photorealistic data and covers various naturalistic factors, such as camera shake and deformation. BlinkVision enables extensive benchmarks on three types of correspondence tasks (optical flow, point tracking, and scene flow estimation) for both image-based and event-based methods, offering new observations, practices, and insights for future research. The benchmark website is https://www.blinkvision.net/.

Paper Structure

This paper contains 14 sections, 1 equation, 7 figures, 9 tables.

Figures (7)

  • Figure 1: BlinkVision is a large-scale and diverse benchmark with rich modality and dense annotation of correspondence. It covers 410 daily categories, sharing common classes with popular 2D and 3D datasets. The per-category object distributions, scene structure hierarchy, data samples, and supported applications of BlinkVision are shown in this figure.
  • Figure 2: Scenes samples in the proposed BlinkVision benchmark.
  • Figure 3: Statistics of optical flow in BlinkVision.
  • Figure 4: Statistics of point trajectories in BlinkVision. To save time, we sample the tracks using a grid with a size of 20. Trajectory segments are defined as contiguous sections of point trajectories, with interruptions caused by occlusion. The diameter is the maximum distance a point moves over time. We clip the trajectory diameter and divide it by the diagonal length of the image to obtain the ratio.
  • Figure 5: Qualitative results of FlowFormer++shi2023flowformer++ before and after fine-tuning on the training set of BlinkVision.
  • ...and 2 more figures