Table of Contents
Fetching ...

Splatter a Video: Video Gaussian Representation for Versatile Processing

Yang-Tian Sun, Yi-Hua Huang, Lin Ma, Xiaoyang Lyu, Yan-Pei Cao, Xiaojuan Qi

TL;DR

This work introduces a novel explicit 3D representation-video Gaussian representation -- that embeds a video into 3D Gaussians using explicit Gaussians as proxies and associates each Gaussian with 3D motions for video motion.

Abstract

Video representation is a long-standing problem that is crucial for various down-stream tasks, such as tracking,depth prediction,segmentation,view synthesis,and editing. However, current methods either struggle to model complex motions due to the absence of 3D structure or rely on implicit 3D representations that are ill-suited for manipulation tasks. To address these challenges, we introduce a novel explicit 3D representation-video Gaussian representation -- that embeds a video into 3D Gaussians. Our proposed representation models video appearance in a 3D canonical space using explicit Gaussians as proxies and associates each Gaussian with 3D motions for video motion. This approach offers a more intrinsic and explicit representation than layered atlas or volumetric pixel matrices. To obtain such a representation, we distill 2D priors, such as optical flow and depth, from foundation models to regularize learning in this ill-posed setting. Extensive applications demonstrate the versatility of our new video representation. It has been proven effective in numerous video processing tasks, including tracking, consistent video depth and feature refinement, motion and appearance editing, and stereoscopic video generation. Project page: https://sunyangtian.github.io/spatter_a_video_web/

Splatter a Video: Video Gaussian Representation for Versatile Processing

TL;DR

This work introduces a novel explicit 3D representation-video Gaussian representation -- that embeds a video into 3D Gaussians using explicit Gaussians as proxies and associates each Gaussian with 3D motions for video motion.

Abstract

Video representation is a long-standing problem that is crucial for various down-stream tasks, such as tracking,depth prediction,segmentation,view synthesis,and editing. However, current methods either struggle to model complex motions due to the absence of 3D structure or rely on implicit 3D representations that are ill-suited for manipulation tasks. To address these challenges, we introduce a novel explicit 3D representation-video Gaussian representation -- that embeds a video into 3D Gaussians. Our proposed representation models video appearance in a 3D canonical space using explicit Gaussians as proxies and associates each Gaussian with 3D motions for video motion. This approach offers a more intrinsic and explicit representation than layered atlas or volumetric pixel matrices. To obtain such a representation, we distill 2D priors, such as optical flow and depth, from foundation models to regularize learning in this ill-posed setting. Extensive applications demonstrate the versatility of our new video representation. It has been proven effective in numerous video processing tasks, including tracking, consistent video depth and feature refinement, motion and appearance editing, and stereoscopic video generation. Project page: https://sunyangtian.github.io/spatter_a_video_web/
Paper Structure (25 sections, 16 equations, 10 figures, 2 tables)

This paper contains 25 sections, 16 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: We propose an approach to convert a video into a Video Gaussian Representation (VGR), which can be used for versatile video processing tasks conveniently.
  • Figure 2: Pipeline of our approach. Given a video, we represent its intricate 3D content using video Gaussians in the camera coordinate space. By associating them with motion parameters, we enable video Gaussians to capture the video dynamics. These video Gaussians are supervised by RGB image frames and 2D priors such as optical flow, depth, and label masks. This representation makes it convenient for users to perform various editing tasks on the video.
  • Figure 3: Qualitative comparison of video reconstruction using our method and SOTA methods.
  • Figure 4: Dense tracking results on diverse complex motion patterns.
  • Figure 5: Qualitative comparison of video depth and features generated by our method and SOTA single-frame estimation methods. Our method yields more consistent estimations.
  • ...and 5 more figures