Table of Contents
Fetching ...

Video Decomposition Prior: A Methodology to Decompose Videos into Layers

Gaurav Shrivastava, Ser-Nam Lim, Abhinav Shrivastava

TL;DR

The paper tackles the data-hungry nature of video enhancement tasks by introducing Video Decomposition Prior (VDP), an inference-time optimization that operates on a single video using FlowRGB cues to decompose scenes into multiple RGB layers with opacities. It formulates two modular networks, RGBnet and α-net, to model appearance and motion-based transmission maps, and optimizes a unified loss with reconstruction and temporal-warp terms, enabling tasks such as video relighting, dehazing, and unsupervised video object segmentation without external training data. A novel logarithmic decomposition for relighting, robust UVOS performance, and state-of-the-art results on dehazing/relighting demonstrate the framework’s effectiveness, while coherent edit propagation shows practical utility for video edits. The approach leverages test-time optimization, motion-aware decomposition, and FlowRGB guidance to achieve high-quality, generalizable results, with clear limitations around optical-flow accuracy, layer count, and computational cost.

Abstract

In the evolving landscape of video enhancement and editing methodologies, a majority of deep learning techniques often rely on extensive datasets of observed input and ground truth sequence pairs for optimal performance. Such reliance often falters when acquiring data becomes challenging, especially in tasks like video dehazing and relighting, where replicating identical motions and camera angles in both corrupted and ground truth sequences is complicated. Moreover, these conventional methodologies perform best when the test distribution closely mirrors the training distribution. Recognizing these challenges, this paper introduces a novel video decomposition prior `VDP' framework which derives inspiration from professional video editing practices. Our methodology does not mandate task-specific external data corpus collection, instead pivots to utilizing the motion and appearance of the input video. VDP framework decomposes a video sequence into a set of multiple RGB layers and associated opacity levels. These set of layers are then manipulated individually to obtain the desired results. We addresses tasks such as video object segmentation, dehazing, and relighting. Moreover, we introduce a novel logarithmic video decomposition formulation for video relighting tasks, setting a new benchmark over the existing methodologies. We observe the property of relighting emerge as we optimize for our novel relighting decomposition formulation. We evaluate our approach on standard video datasets like DAVIS, REVIDE, & SDSD and show qualitative results on a diverse array of internet videos. Project Page - https://www.cs.umd.edu/~gauravsh/video_decomposition/index.html for video results.

Video Decomposition Prior: A Methodology to Decompose Videos into Layers

TL;DR

The paper tackles the data-hungry nature of video enhancement tasks by introducing Video Decomposition Prior (VDP), an inference-time optimization that operates on a single video using FlowRGB cues to decompose scenes into multiple RGB layers with opacities. It formulates two modular networks, RGBnet and α-net, to model appearance and motion-based transmission maps, and optimizes a unified loss with reconstruction and temporal-warp terms, enabling tasks such as video relighting, dehazing, and unsupervised video object segmentation without external training data. A novel logarithmic decomposition for relighting, robust UVOS performance, and state-of-the-art results on dehazing/relighting demonstrate the framework’s effectiveness, while coherent edit propagation shows practical utility for video edits. The approach leverages test-time optimization, motion-aware decomposition, and FlowRGB guidance to achieve high-quality, generalizable results, with clear limitations around optical-flow accuracy, layer count, and computational cost.

Abstract

In the evolving landscape of video enhancement and editing methodologies, a majority of deep learning techniques often rely on extensive datasets of observed input and ground truth sequence pairs for optimal performance. Such reliance often falters when acquiring data becomes challenging, especially in tasks like video dehazing and relighting, where replicating identical motions and camera angles in both corrupted and ground truth sequences is complicated. Moreover, these conventional methodologies perform best when the test distribution closely mirrors the training distribution. Recognizing these challenges, this paper introduces a novel video decomposition prior `VDP' framework which derives inspiration from professional video editing practices. Our methodology does not mandate task-specific external data corpus collection, instead pivots to utilizing the motion and appearance of the input video. VDP framework decomposes a video sequence into a set of multiple RGB layers and associated opacity levels. These set of layers are then manipulated individually to obtain the desired results. We addresses tasks such as video object segmentation, dehazing, and relighting. Moreover, we introduce a novel logarithmic video decomposition formulation for video relighting tasks, setting a new benchmark over the existing methodologies. We observe the property of relighting emerge as we optimize for our novel relighting decomposition formulation. We evaluate our approach on standard video datasets like DAVIS, REVIDE, & SDSD and show qualitative results on a diverse array of internet videos. Project Page - https://www.cs.umd.edu/~gauravsh/video_decomposition/index.html for video results.

Paper Structure

This paper contains 27 sections, 19 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Visual representation of video edits obtained using VDP. The first row demonstrates a foreground manipulation example that leverages an object mask as the decomposition guide. This object mask is obtained by performing Unsupervised Video Object Segmentation (UVOS), which is a downstream task of video decomposition and is achieved using our proposed framework. Our approach effectively separates the foreground objects and background in the video sequence, enabling us to perform object manipulation. The second row shows the result of our approach for video dehazing, where our method effectively removes the haze from the scene. Finally, the third row showcases the effectiveness of our approach for video relighting, where our method effectively changes the lighting of the scene. Our proposed approach outperforms the state-of-the-art methods for the latter two tasks, highlighting the efficacy of our framework for video decomposition.
  • Figure 2: VDP for relighting a video sequence. In this pipeline, the input video frame $t$ is fed into a shallow U-Net denoted by $f^{(1)}_\text{RGB}$. While the flow-RGB is given as input to a separate shallow U-Net network denoted by $f^{(1)}_\alpha$. The intermediate output of $f^{(1)}_\text{RGB}$ is a re-lit version of the input frame $t$. While the output of $f^{(1)}_\alpha$ is transmission maps or Tmap denoted by $1/A_t$. It is important to note that $\gamma^{-1}$ is also treated as a trainable parameter in the pipeline. After obtaining the reconstructed frame $t$, we apply reconstruction and warping losses respectively to optimize $f^{(1)}_\text{RGB}$, $f^{(1)}_\alpha$, $\gamma^{-1}$.
  • Figure 3: Qualitative evaluation on Video Relighting benchmark: We compare the re-lit result of our method with the baselines on the SDSD (wang2021seeing) dataset. We compare our method against both the image and video baselines. ZeroDCE++ (Zero-DCE++) method is a image-based baseline while SDSD (wang2021seeing) and Stablellve (zhang2021learning) are video baselines. Please note that we have utilized the pretrained models of all the baselines to obtain qualitative results. Methods SDSD and Stablellve require training on the external dataset, while our approach operates directly on the low-light video sequence.
  • Figure 4: Our Framework: In this figure, we present the pipeline for the task of decomposing the video into two different components. For the above configuration, the input video frame $t$ is fed into two different shallow U-Nets denoted by $f^{(1)}_\text{RGB}$ and $f^{(2)}_\text{RGB}$ respectively. We obtain two different intermediate components of the input frame $t$ in the form of layer 1 and layer 2. Additionally, we process forward flow-RGBs using a separate shallow U-Net network denoted by $f^{(1)}_\alpha$. This network outputs $\alpha$ compositing maps, which are then used for blending both layers respectively to reconstruct the original input frame $t$. After the compositing, we apply various losses (reconstruction loss and regularization loss) depending on the task at hand. (Right) Illustration of Flow Similarity loss: we define this loss over flow-RGB image ($F^{RGB}$). This loss ensures that pixels with similar motions are grouped together. Illustration of Reconstruction layer loss: We define this loss over the masked object layers and masked input frame $X_t$. This loss ensures that the masked object layer such as $M_tf^{(1)}_\text{RGB}$ has the same appearance as the masked input frame $M_tX_t$.
  • Figure 5: Qualitative evaluation on VOS benchmark: We compare the mask from our method with the baselines on DAVIS-16 (Perazzi2016) dataset. We compare our method against the baselines DS (ye2022deformable) and DyStab (yang2021dystab). Please note the above masks (exception: DoubleDIP) for baselines are pre-computed masks provided by the authors of the papers.
  • ...and 6 more figures