Table of Contents
Fetching ...

Continuous-Time Human Motion Field from Events

Ziyun Wang, Ruijun Zhang, Zi-Yan Liu, Yufu Wang, Kostas Daniilidis

TL;DR

This work introduces EvHuman, the first method to predict a continuous-time human motion field directly from event streams by leveraging a neural motion prior and a time-continuous decoder, enabling pose queries at arbitrary timestamps with parallel inference. It combines a GRU-based event predictor, NeMF-based motion priors, and a differentiable, event-contrastive supervision signal to jointly estimate local SMPL poses and global motion, while avoiding the computational bottlenecks of discrete-pose optimization. The authors demonstrate superior accuracy and significantly faster inference than prior event-based methods on MMHPSD and the new BEAHM dataset, which provides hardware-synchronized, high-frame-rate ground truth at 120 FPS. The BEAHM dataset, along with the proposed losses and training scheme, enables robust evaluation of high-speed human motion under various lighting conditions and motions, highlighting EvHuman’s practical impact for real-time, high-fidelity motion capture from events.

Abstract

This paper addresses the challenges of estimating a continuous-time human motion field from a stream of events. Existing Human Mesh Recovery (HMR) methods rely predominantly on frame-based approaches, which are prone to aliasing and inaccuracies due to limited temporal resolution and motion blur. In this work, we predict a continuous-time human motion field directly from events by leveraging a recurrent feed-forward neural network to predict human motion in the latent space of possible human motions. Prior state-of-the-art event-based methods rely on computationally intensive optimization across a fixed number of poses at high frame rates, which becomes prohibitively expensive as we increase the temporal resolution. In comparison, we present the first work that replaces traditional discrete-time predictions with a continuous human motion field represented as a time-implicit function, enabling parallel pose queries at arbitrary temporal resolutions. Despite the promises of event cameras, few benchmarks have tested the limit of high-speed human motion estimation. We introduce Beam-splitter Event Agile Human Motion Dataset-a hardware-synchronized high-speed human dataset to fill this gap. On this new data, our method improves joint errors by 23.8% compared to previous event human methods while reducing the computational time by 69%.

Continuous-Time Human Motion Field from Events

TL;DR

This work introduces EvHuman, the first method to predict a continuous-time human motion field directly from event streams by leveraging a neural motion prior and a time-continuous decoder, enabling pose queries at arbitrary timestamps with parallel inference. It combines a GRU-based event predictor, NeMF-based motion priors, and a differentiable, event-contrastive supervision signal to jointly estimate local SMPL poses and global motion, while avoiding the computational bottlenecks of discrete-pose optimization. The authors demonstrate superior accuracy and significantly faster inference than prior event-based methods on MMHPSD and the new BEAHM dataset, which provides hardware-synchronized, high-frame-rate ground truth at 120 FPS. The BEAHM dataset, along with the proposed losses and training scheme, enables robust evaluation of high-speed human motion under various lighting conditions and motions, highlighting EvHuman’s practical impact for real-time, high-fidelity motion capture from events.

Abstract

This paper addresses the challenges of estimating a continuous-time human motion field from a stream of events. Existing Human Mesh Recovery (HMR) methods rely predominantly on frame-based approaches, which are prone to aliasing and inaccuracies due to limited temporal resolution and motion blur. In this work, we predict a continuous-time human motion field directly from events by leveraging a recurrent feed-forward neural network to predict human motion in the latent space of possible human motions. Prior state-of-the-art event-based methods rely on computationally intensive optimization across a fixed number of poses at high frame rates, which becomes prohibitively expensive as we increase the temporal resolution. In comparison, we present the first work that replaces traditional discrete-time predictions with a continuous human motion field represented as a time-implicit function, enabling parallel pose queries at arbitrary temporal resolutions. Despite the promises of event cameras, few benchmarks have tested the limit of high-speed human motion estimation. We introduce Beam-splitter Event Agile Human Motion Dataset-a hardware-synchronized high-speed human dataset to fill this gap. On this new data, our method improves joint errors by 23.8% compared to previous event human methods while reducing the computational time by 69%.

Paper Structure

This paper contains 16 sections, 17 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: EvHuman predicts a set of global and local latent codes from an event stream to represent continuous-time human motions. The latent codes are decoded by a neural human motion prior in a time-continuous MLP network that can be queried at any time resolution in parallel efficiently. The human sequences on the rigth are decoded with a test event stream in MMHPSD zou2021eventhpe.
  • Figure 2: Pipeline: EvHuman takes in a continuous stream of events. The input event volumes are passed into a set of shared encoders. The encoded motion-rich features are further processed with a temporal aggregation network that iteratively refine a hidden state $h_0$. We apply two linear projects to project the terminal hidden state into the global and motion latent code. The latent codes are decoded with a pre-trained neural motion prior decoder to a MLP network that predicts SMPL parameters and global translation at any time.
  • Figure 3: We present four example sequences from data collection of EvHuman. Each sequence, from left to right, includes: (a-d) Four multi-camera images with bounding boxes and skeleton estimations via EasyMocap easymocap. (e) Events displayed on the beam splitter RGB camera. (f) The estimated mesh model superimposed on the beam splitter RGB camera.
  • Figure 4: Continuous-time decoding compared to Interpolation. Left and right: Start and end pose. Middle: full joint trajectory. Interpolated key points marked in red and continuous pose in blue.
  • Figure 5: Human Event Contrast Maximization. Left: Raw event IWE. Middle: Motion-compensated events using estimated human motion. Right: Dense motion field from our continuous-time human motion field. Color indicates direction of optical flow.
  • ...and 2 more figures