Table of Contents
Fetching ...

Gaussian See, Gaussian Do: Semantic 3D Motion Transfer from Multiview Video

Yarin Bekor, Gal Michael Harari, Or Perel, Or Litany

TL;DR

Gaussian See, Gaussian Do delivers semantic 3D motion transfer from multiview video to static $3DGS$ targets, enabling cross-category, rig-free animation by bridging 2D diffusion priors with 3D dynamic rendering. The approach combines anchor-based view-aware motion embeddings learned via condition-inversion, a multi-stage pipeline (Structured Inversion, View-aware Transfer, 4D Consolidation), and a robust regularized 4D reconstruction pipeline. It introduces the first benchmark for semantic 3D motion transfer, and demonstrates superior motion fidelity and structural consistency over adapted baselines, with compelling in-the-wild results and a human preference study. By enabling semantically meaningful motion transfer for arbitrary 3D assets, this work broadens practical 3D animation from video data and connects diffusion-based motion priors with cross-category 3D synthesis.

Abstract

We present Gaussian See, Gaussian Do, a novel approach for semantic 3D motion transfer from multiview video. Our method enables rig-free, cross-category motion transfer between objects with semantically meaningful correspondence. Building on implicit motion transfer techniques, we extract motion embeddings from source videos via condition inversion, apply them to rendered frames of static target shapes, and use the resulting videos to supervise dynamic 3D Gaussian Splatting reconstruction. Our approach introduces an anchor-based view-aware motion embedding mechanism, ensuring cross-view consistency and accelerating convergence, along with a robust 4D reconstruction pipeline that consolidates noisy supervision videos. We establish the first benchmark for semantic 3D motion transfer and demonstrate superior motion fidelity and structural consistency compared to adapted baselines. Code and data for this paper available at https://gsgd-motiontransfer.github.io/

Gaussian See, Gaussian Do: Semantic 3D Motion Transfer from Multiview Video

TL;DR

Gaussian See, Gaussian Do delivers semantic 3D motion transfer from multiview video to static targets, enabling cross-category, rig-free animation by bridging 2D diffusion priors with 3D dynamic rendering. The approach combines anchor-based view-aware motion embeddings learned via condition-inversion, a multi-stage pipeline (Structured Inversion, View-aware Transfer, 4D Consolidation), and a robust regularized 4D reconstruction pipeline. It introduces the first benchmark for semantic 3D motion transfer, and demonstrates superior motion fidelity and structural consistency over adapted baselines, with compelling in-the-wild results and a human preference study. By enabling semantically meaningful motion transfer for arbitrary 3D assets, this work broadens practical 3D animation from video data and connects diffusion-based motion priors with cross-category 3D synthesis.

Abstract

We present Gaussian See, Gaussian Do, a novel approach for semantic 3D motion transfer from multiview video. Our method enables rig-free, cross-category motion transfer between objects with semantically meaningful correspondence. Building on implicit motion transfer techniques, we extract motion embeddings from source videos via condition inversion, apply them to rendered frames of static target shapes, and use the resulting videos to supervise dynamic 3D Gaussian Splatting reconstruction. Our approach introduces an anchor-based view-aware motion embedding mechanism, ensuring cross-view consistency and accelerating convergence, along with a robust 4D reconstruction pipeline that consolidates noisy supervision videos. We establish the first benchmark for semantic 3D motion transfer and demonstrate superior motion fidelity and structural consistency compared to adapted baselines. Code and data for this paper available at https://gsgd-motiontransfer.github.io/

Paper Structure

This paper contains 50 sections, 10 equations, 17 figures, 3 tables.

Figures (17)

  • Figure 1: Pipeline Overview. (1) Structured Multiview Motion Inversion. We extract motion embeddings from the source using the slerp interpolation from the two nearest achor points. (2) View-aware Semantic Motion Transfer. We use the motion embeddings to generate supervision for the motion transfer process, and then (3) 4D Consolidation. We apply the supervision onto the target shape to introduce actual motion
  • Figure 2: Qualitative comparison of semantic 3D motion transfer. We compare our method to adapted baselines, showing two views of the source motion, the target 3DGS object, and the generated output for each method. We demonstrate superior identity preservation while also accurately transferring the source motion to the target.
  • Figure 3: In-the-wild Motion Transfer. Our method animates 3D assets reconstructed from real-world imagery, demonstrating robust semantic motion transfer in real scenes using 3DGS. The ‘Scanned scene’ column visualizes sparse sample of the 3DGS scene produced by our reconstruction stage, highlighting that the motion is applied directly to the scene.
  • Figure 4: Human preference study. Left: Mean subjective ratings for motion plausibility and appearance fidelity. Our method is the only one to preserve target identity while delivering high-quality motion transfer. Right: Preference results from an ablation study, showing that both LPIPS and ARAP rotation substantially improve perceptual quality.
  • Figure 5: Qualitative comparison of novel-view motion synthesis. Interpolating simple motion embeddings (Eq. \ref{['eq:simple_embeddings']}) and a single global embedding both fail to generalize motion embeddings to views unseen during optimization. In contrast, our anchor-based mechanism successfully recovers faithful motion. A similar behavior is observed for the target object before and after reconstruction (Appendix \ref{['app:novel_motion_synthesis']}).
  • ...and 12 more figures