Table of Contents
Fetching ...

COIN: Control-Inpainting Diffusion Prior for Human and Camera Motion Estimation

Jiefeng Li, Ye Yuan, Davis Rempe, Haotian Zhang, Pavlo Molchanov, Cewu Lu, Jan Kautz, Umar Iqbal

TL;DR

COIN introduces a novel control-inpainting score distillation sampling method to ensure well-aligned, consistent, and high-quality motion from the diffusion prior within a joint optimization framework, which outperforms the state-of-the-art methods in terms of global human motion estimation and camera motion estimation.

Abstract

Estimating global human motion from moving cameras is challenging due to the entanglement of human and camera motions. To mitigate the ambiguity, existing methods leverage learned human motion priors, which however often result in oversmoothed motions with misaligned 2D projections. To tackle this problem, we propose COIN, a control-inpainting motion diffusion prior that enables fine-grained control to disentangle human and camera motions. Although pre-trained motion diffusion models encode rich motion priors, we find it non-trivial to leverage such knowledge to guide global motion estimation from RGB videos. COIN introduces a novel control-inpainting score distillation sampling method to ensure well-aligned, consistent, and high-quality motion from the diffusion prior within a joint optimization framework. Furthermore, we introduce a new human-scene relation loss to alleviate the scale ambiguity by enforcing consistency among the humans, camera, and scene. Experiments on three challenging benchmarks demonstrate the effectiveness of COIN, which outperforms the state-of-the-art methods in terms of global human motion estimation and camera motion estimation. As an illustrative example, COIN outperforms the state-of-the-art method by 33% in world joint position error (W-MPJPE) on the RICH dataset.

COIN: Control-Inpainting Diffusion Prior for Human and Camera Motion Estimation

TL;DR

COIN introduces a novel control-inpainting score distillation sampling method to ensure well-aligned, consistent, and high-quality motion from the diffusion prior within a joint optimization framework, which outperforms the state-of-the-art methods in terms of global human motion estimation and camera motion estimation.

Abstract

Estimating global human motion from moving cameras is challenging due to the entanglement of human and camera motions. To mitigate the ambiguity, existing methods leverage learned human motion priors, which however often result in oversmoothed motions with misaligned 2D projections. To tackle this problem, we propose COIN, a control-inpainting motion diffusion prior that enables fine-grained control to disentangle human and camera motions. Although pre-trained motion diffusion models encode rich motion priors, we find it non-trivial to leverage such knowledge to guide global motion estimation from RGB videos. COIN introduces a novel control-inpainting score distillation sampling method to ensure well-aligned, consistent, and high-quality motion from the diffusion prior within a joint optimization framework. Furthermore, we introduce a new human-scene relation loss to alleviate the scale ambiguity by enforcing consistency among the humans, camera, and scene. Experiments on three challenging benchmarks demonstrate the effectiveness of COIN, which outperforms the state-of-the-art methods in terms of global human motion estimation and camera motion estimation. As an illustrative example, COIN outperforms the state-of-the-art method by 33% in world joint position error (W-MPJPE) on the RICH dataset.
Paper Structure (31 sections, 13 equations, 4 figures, 7 tables, 1 algorithm)

This paper contains 31 sections, 13 equations, 4 figures, 7 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview. Given a video with a moving camera, we recover the global human motion $\mathbf{H}$ and camera motion $\mathcal{C}$ using an iterative optimization framework. We propose a novel Control-Inpainting SDS loss ($\mathcal{L}_\textrm{COIN-SDS}$) to leverage motion diffusion models as a prior. COIN-SDS is designed such that the sampled motions from the motion prior are consistent with video observations. We achieve this by controlling and constraining the sampling process of the motion diffusion model through novel control and soft-inpainting branches. We also propose a novel human-scene relation loss ($\mathcal{L}_\textrm{HSR}$) to encourage consistency among the human motion, camera motion, and scene features.
  • Figure 2: Qualitative comparisons with state-of-the-art methods. PACE kocabas2024pace fails to recover a correct trajectory (left). WHAM shin2024wham estimates the wrong walking direction of the person (right). Our approach, COIN, recovers the human and camera motion accurately in both scenarios.
  • Figure 3: Architecture of the controlled denoiser.
  • Figure 4: Error distributions on the EMDB dataset.