Table of Contents
Fetching ...

THFM: A Unified Video Foundation Model for 4D Human Perception and Beyond

Letian Wang, Andrei Zanfir, Eduard Gabriel Bazavan, Misha Andriluka, Cristian Sminchisescu

Abstract

We present THFM, a unified video foundation model for human-centric perception that jointly addresses dense tasks (depth, normals, segmentation, dense pose) and sparse tasks (2d/3d keypoint estimation) within a single architecture. THFM is derived from a pretrained text-to-video diffusion model, repurposed as a single-forward-pass perception model and augmented with learnable tokens for sparse predictions. Modulated by the text prompt, our single unified model is capable of performing various perception tasks. Crucially, our model is on-par or surpassing state-of-the-art specialized models on a variety of benchmarks despite being trained exclusively on synthetic data (i.e.~without training on real-world or benchmark specific data). We further highlight intriguing emergent properties of our model, which we attribute to the underlying diffusion-based video representation. For example, our model trained on videos with a single human in the scene generalizes to multiple humans and other object classes such as anthropomorphic characters and animals -- a capability that hasn't been demonstrated in the past.

THFM: A Unified Video Foundation Model for 4D Human Perception and Beyond

Abstract

We present THFM, a unified video foundation model for human-centric perception that jointly addresses dense tasks (depth, normals, segmentation, dense pose) and sparse tasks (2d/3d keypoint estimation) within a single architecture. THFM is derived from a pretrained text-to-video diffusion model, repurposed as a single-forward-pass perception model and augmented with learnable tokens for sparse predictions. Modulated by the text prompt, our single unified model is capable of performing various perception tasks. Crucially, our model is on-par or surpassing state-of-the-art specialized models on a variety of benchmarks despite being trained exclusively on synthetic data (i.e.~without training on real-world or benchmark specific data). We further highlight intriguing emergent properties of our model, which we attribute to the underlying diffusion-based video representation. For example, our model trained on videos with a single human in the scene generalizes to multiple humans and other object classes such as anthropomorphic characters and animals -- a capability that hasn't been demonstrated in the past.

Paper Structure

This paper contains 13 sections, 4 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: THFM is a single, unified video perception model with SOTA performance on a multitude of output modalities. From left to right we show: a frame from the input video, estimated surface normals, depth, segmentation, dense human semantics wang2019normalizedGuler2018DensePose, 2d and 3d keypoints. Our approach has been trained on videos of people generated synthetically, yet it generalizes to real videos both for people as well as other categories such as animals and anthropomorphic characters.
  • Figure 2: Method overview of THFM, a simple yet powerful architecture adapted from text-to-video diffusion models. Given an input video and a text prompting specifying the desired output, our unified model, trained only on synthetic data, is capable of performing a wide range of dense and sparse perception tasks, with a single forward-pass of the model. The dense vision tasks are unified in the RGB ambient space where supervision can be applied in both latent space and RGB ambient space, and the sparse vision tasks are realized by adding learnable tokens as additional inputs to the diffusion transformer (DiT).
  • Figure 3: 3D Pose Estimation. Example 3D pose estimation results on challenging skiing (top) and snowboarding (bottom) video. We show every 10-th frame for a subsequence of the input video. Compared to the existing state-of-the-art 3D human pose estimation methods THFM model takes a full video frame as input and does not require pre-processing steps such as person detection or 2d keypoint estimation.
  • Figure 4: Emerging behavior: sim-to-real generalization to OOD classes. Our approach has been trained only on synthetic videos of single class (humans), but generalizes to real-world videos with a variety of other classes of articulated objects.
  • Figure 5: Emerging behavior: generalization to multiple instances. Trained purely on synthetic data with one single human within each video, our method generalizes in zero-shot to real videos with multiple humans.