Table of Contents
Fetching ...

FMPose3D: monocular 3D pose estimation via flow matching

Ti Wang, Xiaohang Yu, Mackenzie Weygandt Mathis

TL;DR

FMPose3D tackles monocular 3D pose estimation by casting it as conditional distribution transport, learning a velocity field $v_\theta$ that deterministically evolves samples from $p_0=\mathcal{N}(0,I)$ toward the conditional 3D pose distribution given a 2D input. The model leverages an ODE, $\frac{dx_t}{dt}=v_\theta(x_t,t,c)$, to generate multiple plausible poses from different noise seeds in just a few steps, addressing depth ambiguity and occlusion. To produce a robust single prediction, it introduces Reprojection-based Posterior Expectation Aggregation (RPEA), which weights pose hypotheses by a pseudo-likelihood based on 2D reprojection loss and computes a joint- or pose-wise MMSE-like estimate. Across Human3.6M, MPI-INF-3DHP, Animal3D, CtrlAni3D, and 3DPW, FMPose3D achieves competitive or state-of-the-art performance with significantly faster inference than diffusion-based methods, and its multi-hypothesis framework yields reliable uncertainty estimates. Code is publicly available, enabling practical deployment in real-time or resource-constrained settings.

Abstract

Monocular 3D pose estimation is fundamentally ill-posed due to depth ambiguity and occlusions, thereby motivating probabilistic methods that generate multiple plausible 3D pose hypotheses. In particular, diffusion-based models have recently demonstrated strong performance, but their iterative denoising process typically requires many timesteps for each prediction, making inference computationally expensive. In contrast, we leverage Flow Matching (FM) to learn a velocity field defined by an Ordinary Differential Equation (ODE), enabling efficient generation of 3D pose samples with only a few integration steps. We propose a novel generative pose estimation framework, FMPose3D, that formulates 3D pose estimation as a conditional distribution transport problem. It continuously transports samples from a standard Gaussian prior to the distribution of plausible 3D poses conditioned only on 2D inputs. Although ODE trajectories are deterministic, FMPose3D naturally generates various pose hypotheses by sampling different noise seeds. To obtain a single accurate prediction from those hypotheses, we further introduce a Reprojection-based Posterior Expectation Aggregation (RPEA) module, which approximates the Bayesian posterior expectation over 3D hypotheses. FMPose3D surpasses existing methods on the widely used human pose estimation benchmarks Human3.6M and MPI-INF-3DHP, and further achieves state-of-the-art performance on the 3D animal pose datasets Animal3D and CtrlAni3D, demonstrating strong performance across both 3D pose domains. The code is available at https://github.com/AdaptiveMotorControlLab/FMPose3D.

FMPose3D: monocular 3D pose estimation via flow matching

TL;DR

FMPose3D tackles monocular 3D pose estimation by casting it as conditional distribution transport, learning a velocity field that deterministically evolves samples from toward the conditional 3D pose distribution given a 2D input. The model leverages an ODE, , to generate multiple plausible poses from different noise seeds in just a few steps, addressing depth ambiguity and occlusion. To produce a robust single prediction, it introduces Reprojection-based Posterior Expectation Aggregation (RPEA), which weights pose hypotheses by a pseudo-likelihood based on 2D reprojection loss and computes a joint- or pose-wise MMSE-like estimate. Across Human3.6M, MPI-INF-3DHP, Animal3D, CtrlAni3D, and 3DPW, FMPose3D achieves competitive or state-of-the-art performance with significantly faster inference than diffusion-based methods, and its multi-hypothesis framework yields reliable uncertainty estimates. Code is publicly available, enabling practical deployment in real-time or resource-constrained settings.

Abstract

Monocular 3D pose estimation is fundamentally ill-posed due to depth ambiguity and occlusions, thereby motivating probabilistic methods that generate multiple plausible 3D pose hypotheses. In particular, diffusion-based models have recently demonstrated strong performance, but their iterative denoising process typically requires many timesteps for each prediction, making inference computationally expensive. In contrast, we leverage Flow Matching (FM) to learn a velocity field defined by an Ordinary Differential Equation (ODE), enabling efficient generation of 3D pose samples with only a few integration steps. We propose a novel generative pose estimation framework, FMPose3D, that formulates 3D pose estimation as a conditional distribution transport problem. It continuously transports samples from a standard Gaussian prior to the distribution of plausible 3D poses conditioned only on 2D inputs. Although ODE trajectories are deterministic, FMPose3D naturally generates various pose hypotheses by sampling different noise seeds. To obtain a single accurate prediction from those hypotheses, we further introduce a Reprojection-based Posterior Expectation Aggregation (RPEA) module, which approximates the Bayesian posterior expectation over 3D hypotheses. FMPose3D surpasses existing methods on the widely used human pose estimation benchmarks Human3.6M and MPI-INF-3DHP, and further achieves state-of-the-art performance on the 3D animal pose datasets Animal3D and CtrlAni3D, demonstrating strong performance across both 3D pose domains. The code is available at https://github.com/AdaptiveMotorControlLab/FMPose3D.
Paper Structure (25 sections, 13 equations, 12 figures, 8 tables, 1 algorithm)

This paper contains 25 sections, 13 equations, 12 figures, 8 tables, 1 algorithm.

Figures (12)

  • Figure 1: Overview of the training process. The process starts from a noise sample $x_0 \!\sim\! \mathcal{N}(0, I)$ and a ground-truth 3D pose $x_1$ from the training set. An intermediate sample $x_t$ is obtained by linear interpolation between $x_0$ and $x_1$. The red region illustrates the valid 3D pose data manifold. The network $v_\theta(x_t, t, c)$, conditioned on the 2D pose $c = x^{2D}$, is trained to predict the true velocity $v_t$. The Flow Matching loss $\mathcal{L}_{\text{CFM}} = \|v_\theta - v_t\|_2^2$ minimizes the discrepancy between the predicted and ground-truth velocities.
  • Figure 2: Illustration of multi-hypothesis generation and aggregation during inference.
  • Figure 3: Comparison of different aggregation strategies on the Human3.6M test set. The top plot reports MPJPE, while the bottom plot shows P-MPJPE.
  • Figure 4: Qualitative comparison of DiffPose gong2023diffpose and FMPose3D on Human3.6M. The blue pose represents the predicted results, while the red pose represents the ground truth.
  • Figure 5: Qualitative comparison of AniMer lyu2025animer and FMPose3D on Animal3D (left column) and CtrlAni3D (right column).
  • ...and 7 more figures