Table of Contents
Fetching ...

POBEVM: Real-time Video Matting via Progressively Optimize the Target Body and Edge

Jianming Xian

TL;DR

POBEVM tackles real-time trimap-free video matting by separating optimization of target body and edge via the SOBE block. The network uses an encoder–decoder with attention-guided SOBE blocks and an optional Deep Guided Filter, plus Edge-L1-Loss to strengthen edge predictions. Evaluations on VM and D646 datasets show state-of-the-art edge and overall matting performance among trimap-free methods; segmentation experiments demonstrate SOBE's generality to refine edges in camouflaged-object segmentation. The method reduces reliance on manual trimaps while achieving sharper edges suitable for downstream video editing.

Abstract

Deep convolutional neural networks (CNNs) based approaches have achieved great performance in video matting. Many of these methods can produce accurate alpha estimation for the target body but typically yield fuzzy or incorrect target edges. This is usually caused by the following reasons: 1) The current methods always treat the target body and edge indiscriminately; 2) Target body dominates the whole target with only a tiny proportion target edge. For the first problem, we propose a CNN-based module that separately optimizes the matting target body and edge (SOBE). And on this basis, we introduce a real-time, trimap-free video matting method via progressively optimizing the matting target body and edge (POBEVM) that is much lighter than previous approaches and achieves significant improvements in the predicted target edge. For the second problem, we propose an Edge-L1-Loss (ELL) function that enforces our network on the matting target edge. Experiments demonstrate our method outperforms prior trimap-free matting methods on both Distinctions-646 (D646) and VideoMatte240K(VM) dataset, especially in edge optimization.

POBEVM: Real-time Video Matting via Progressively Optimize the Target Body and Edge

TL;DR

POBEVM tackles real-time trimap-free video matting by separating optimization of target body and edge via the SOBE block. The network uses an encoder–decoder with attention-guided SOBE blocks and an optional Deep Guided Filter, plus Edge-L1-Loss to strengthen edge predictions. Evaluations on VM and D646 datasets show state-of-the-art edge and overall matting performance among trimap-free methods; segmentation experiments demonstrate SOBE's generality to refine edges in camouflaged-object segmentation. The method reduces reliance on manual trimaps while achieving sharper edges suitable for downstream video editing.

Abstract

Deep convolutional neural networks (CNNs) based approaches have achieved great performance in video matting. Many of these methods can produce accurate alpha estimation for the target body but typically yield fuzzy or incorrect target edges. This is usually caused by the following reasons: 1) The current methods always treat the target body and edge indiscriminately; 2) Target body dominates the whole target with only a tiny proportion target edge. For the first problem, we propose a CNN-based module that separately optimizes the matting target body and edge (SOBE). And on this basis, we introduce a real-time, trimap-free video matting method via progressively optimizing the matting target body and edge (POBEVM) that is much lighter than previous approaches and achieves significant improvements in the predicted target edge. For the second problem, we propose an Edge-L1-Loss (ELL) function that enforces our network on the matting target edge. Experiments demonstrate our method outperforms prior trimap-free matting methods on both Distinctions-646 (D646) and VideoMatte240K(VM) dataset, especially in edge optimization.
Paper Structure (11 sections, 5 equations, 3 figures, 4 tables)

This paper contains 11 sections, 5 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Detailed implementation of the proposed POBEVM network structure. There are five alpha predictions and one foreground prediction. And, when the DGF module is used, its outputs will replace the outputs of the Out-blk.
  • Figure 2: The left is the SOBE block, and the right is the output block(Out-blk). Both PC block and FC block are convolutional layer and have the same parameters except for the different number of output channels. And the UP-Features is one of the inputs to DGF for generating high-resolution output
  • Figure 3: Visualization of alpha matte predictions from BGMv2, RVM, VMFormer and POBEVM(OURS). Our method produces more detailed alpha compared to others.