Table of Contents
Fetching ...

A Unified Solution to Video Fusion: From Multi-Frame Learning to Benchmarking

Zixiang Zhao, Haowen Bai, Bingxin Ke, Yukun Cui, Lilun Deng, Yulun Zhang, Kai Zhang, Konrad Schindler

TL;DR

This work introduces UniVF, a unified video fusion framework that exploits temporal information through multi-frame learning and optical-flow-based feature warping, coupled with a temporal consistency loss to reduce flicker. It is evaluated on VF-Bench, the first comprehensive four-task video fusion benchmark spanning multi-exposure, multi-focus, infrared-visible, and medical fusion with a unified spatial-temporal evaluation protocol. Across all tasks, UniVF achieves state-of-the-art results, validating the effectiveness of joint spatial-temporal modeling and flow-guided alignment. The VF-Bench dataset and evaluation suite provide a robust foundation for future research in temporally coherent video fusion and cross-task benchmarking.

Abstract

The real world is dynamic, yet most image fusion methods process static frames independently, ignoring temporal correlations in videos and leading to flickering and temporal inconsistency. To address this, we propose Unified Video Fusion (UniVF), a novel and unified framework for video fusion that leverages multi-frame learning and optical flow-based feature warping for informative, temporally coherent video fusion. To support its development, we also introduce Video Fusion Benchmark (VF-Bench), the first comprehensive benchmark covering four video fusion tasks: multi-exposure, multi-focus, infrared-visible, and medical fusion. VF-Bench provides high-quality, well-aligned video pairs obtained through synthetic data generation and rigorous curation from existing datasets, with a unified evaluation protocol that jointly assesses the spatial quality and temporal consistency of video fusion. Extensive experiments show that UniVF achieves state-of-the-art results across all tasks on VF-Bench. Project page: https://vfbench.github.io.

A Unified Solution to Video Fusion: From Multi-Frame Learning to Benchmarking

TL;DR

This work introduces UniVF, a unified video fusion framework that exploits temporal information through multi-frame learning and optical-flow-based feature warping, coupled with a temporal consistency loss to reduce flicker. It is evaluated on VF-Bench, the first comprehensive four-task video fusion benchmark spanning multi-exposure, multi-focus, infrared-visible, and medical fusion with a unified spatial-temporal evaluation protocol. Across all tasks, UniVF achieves state-of-the-art results, validating the effectiveness of joint spatial-temporal modeling and flow-guided alignment. The VF-Bench dataset and evaluation suite provide a robust foundation for future research in temporally coherent video fusion and cross-task benchmarking.

Abstract

The real world is dynamic, yet most image fusion methods process static frames independently, ignoring temporal correlations in videos and leading to flickering and temporal inconsistency. To address this, we propose Unified Video Fusion (UniVF), a novel and unified framework for video fusion that leverages multi-frame learning and optical flow-based feature warping for informative, temporally coherent video fusion. To support its development, we also introduce Video Fusion Benchmark (VF-Bench), the first comprehensive benchmark covering four video fusion tasks: multi-exposure, multi-focus, infrared-visible, and medical fusion. VF-Bench provides high-quality, well-aligned video pairs obtained through synthetic data generation and rigorous curation from existing datasets, with a unified evaluation protocol that jointly assesses the spatial quality and temporal consistency of video fusion. Extensive experiments show that UniVF achieves state-of-the-art results across all tasks on VF-Bench. Project page: https://vfbench.github.io.

Paper Structure

This paper contains 19 sections, 24 equations, 18 figures, 5 tables.

Figures (18)

  • Figure 1: Overview of our main contribution in this paper.
  • Figure 2: Detailed illustration of our UniVF architecture.
  • Figure 3: The proposed data generation paradigms for (a) multi-exposure video pair and (b) multi-focus video pair for our VF-Bench.
  • Figure 4: Previous, current, and next frames with their corresponding validity masks $M^t_{\text{prev}}$ and $M^t_{\text{next}}$. Black regions denote invalid or unreliable areas, corresponding to poorly aligned or occluded pixels that are excluded from the temporal consistency computation.
  • Figure 5: Qualitative comparison of fusion outcomes for multi-exposure video fusion.
  • ...and 13 more figures