Table of Contents
Fetching ...

MambaVF: State Space Model for Efficient Video Fusion

Zixiang Zhao, Yukun Cui, Lilun Deng, Haowen Bai, Haotong Qin, Tao Feng, Konrad Schindler

TL;DR

The paper tackles the computational bottlenecks of flow-based video fusion by introducing MambaVF, a flow-free approach built on State Space Models (SSMs) that models temporal evolution with linear complexity $O(T)$. It couples an 8-way spatio-temporal bidirectional (STB) scanning mechanism with a dual-stream Tri-Axis Mamba Encoder to enable cross-modal fusion without explicit motion estimation. Across VF-Bench tasks for MEF, MFF, IVF, and MVF, MambaVF achieves state-of-the-art results while markedly reducing parameters ($ ext{params} o 7.75 ext{%}$ of a flow-based baseline) and FLOPs ($ ext{FLOPs} o 11.21 ext{%}$), delivering a 2.1× speedup. Ablation studies validate the STB design, temporal context, and decoder choices, highlighting the method's efficiency and robustness. The work offers a practical path to real-time, edge-friendly multi-source video fusion across diverse modalities.

Abstract

Video fusion is a fundamental technique in various video processing tasks. However, existing video fusion methods heavily rely on optical flow estimation and feature warping, resulting in severe computational overhead and limited scalability. This paper presents MambaVF, an efficient video fusion framework based on state space models (SSMs) that performs temporal modeling without explicit motion estimation. First, by reformulating video fusion as a sequential state update process, MambaVF captures long-range temporal dependencies with linear complexity while significantly reducing computation and memory costs. Second, MambaVF proposes a lightweight SSM-based fusion module that replaces conventional flow-guided alignment via a spatio-temporal bidirectional scanning mechanism. This module enables efficient information aggregation across frames. Extensive experiments across multiple benchmarks demonstrate that our MambaVF achieves state-of-the-art performance in multi-exposure, multi-focus, infrared-visible, and medical video fusion tasks. We highlight that MambaVF enjoys high efficiency, reducing up to 92.25% of parameters and 88.79% of computational FLOPs and a 2.1x speedup compared to existing methods. Project page: https://mambavf.github.io

MambaVF: State Space Model for Efficient Video Fusion

TL;DR

The paper tackles the computational bottlenecks of flow-based video fusion by introducing MambaVF, a flow-free approach built on State Space Models (SSMs) that models temporal evolution with linear complexity . It couples an 8-way spatio-temporal bidirectional (STB) scanning mechanism with a dual-stream Tri-Axis Mamba Encoder to enable cross-modal fusion without explicit motion estimation. Across VF-Bench tasks for MEF, MFF, IVF, and MVF, MambaVF achieves state-of-the-art results while markedly reducing parameters ( of a flow-based baseline) and FLOPs (), delivering a 2.1× speedup. Ablation studies validate the STB design, temporal context, and decoder choices, highlighting the method's efficiency and robustness. The work offers a practical path to real-time, edge-friendly multi-source video fusion across diverse modalities.

Abstract

Video fusion is a fundamental technique in various video processing tasks. However, existing video fusion methods heavily rely on optical flow estimation and feature warping, resulting in severe computational overhead and limited scalability. This paper presents MambaVF, an efficient video fusion framework based on state space models (SSMs) that performs temporal modeling without explicit motion estimation. First, by reformulating video fusion as a sequential state update process, MambaVF captures long-range temporal dependencies with linear complexity while significantly reducing computation and memory costs. Second, MambaVF proposes a lightweight SSM-based fusion module that replaces conventional flow-guided alignment via a spatio-temporal bidirectional scanning mechanism. This module enables efficient information aggregation across frames. Extensive experiments across multiple benchmarks demonstrate that our MambaVF achieves state-of-the-art performance in multi-exposure, multi-focus, infrared-visible, and medical video fusion tasks. We highlight that MambaVF enjoys high efficiency, reducing up to 92.25% of parameters and 88.79% of computational FLOPs and a 2.1x speedup compared to existing methods. Project page: https://mambavf.github.io
Paper Structure (13 sections, 5 equations, 12 figures, 5 tables)

This paper contains 13 sections, 5 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Compared with UniVF zhao2025unified, our MambaVF not only attains state-of-the-art performance on VF-Bench zhao2025unified, but also requires only 7.75% of the parameters and 11.21% of the FLOPs, while achieving a 2.1× speedup.
  • Figure 2: Parameter, FLOPs, and runtime contribution of optical flow and feature warping modules in UniVF zhao2025unified.
  • Figure 3: An overview of the proposed MambaVF architecture.
  • Figure 4: Visual comparison of fused results on multi-exposure video fusion.
  • Figure 5: Visual comparison of fused results on multi-focus video fusion.
  • ...and 7 more figures