Table of Contents
Fetching ...

VioPose: Violin Performance 4D Pose Estimation by Hierarchical Audiovisual Inference

Seong Jong Yoo, Snehesh Shrestha, Irina Muresanu, Cornelia Fermüller

TL;DR

This work proposes VioPose: a novel multimodal network that hierarchically estimates dynamics, and its architecture is shown to produce accurate pose sequences, facilitating precise motion analysis, and outperforms SoTA.

Abstract

Musicians delicately control their bodies to generate music. Sometimes, their motions are too subtle to be captured by the human eye. To analyze how they move to produce the music, we need to estimate precise 4D human pose (3D pose over time). However, current state-of-the-art (SoTA) visual pose estimation algorithms struggle to produce accurate monocular 4D poses because of occlusions, partial views, and human-object interactions. They are limited by the viewing angle, pixel density, and sampling rate of the cameras and fail to estimate fast and subtle movements, such as in the musical effect of vibrato. We leverage the direct causal relationship between the music produced and the human motions creating them to address these challenges. We propose VioPose: a novel multimodal network that hierarchically estimates dynamics. High-level features are cascaded to low-level features and integrated into Bayesian updates. Our architecture is shown to produce accurate pose sequences, facilitating precise motion analysis, and outperforms SoTA. As part of this work, we collected the largest and the most diverse calibrated violin-playing dataset, including video, sound, and 3D motion capture poses. Code and dataset can be found in our project page \url{https://sj-yoo.info/viopose/}.

VioPose: Violin Performance 4D Pose Estimation by Hierarchical Audiovisual Inference

TL;DR

This work proposes VioPose: a novel multimodal network that hierarchically estimates dynamics, and its architecture is shown to produce accurate pose sequences, facilitating precise motion analysis, and outperforms SoTA.

Abstract

Musicians delicately control their bodies to generate music. Sometimes, their motions are too subtle to be captured by the human eye. To analyze how they move to produce the music, we need to estimate precise 4D human pose (3D pose over time). However, current state-of-the-art (SoTA) visual pose estimation algorithms struggle to produce accurate monocular 4D poses because of occlusions, partial views, and human-object interactions. They are limited by the viewing angle, pixel density, and sampling rate of the cameras and fail to estimate fast and subtle movements, such as in the musical effect of vibrato. We leverage the direct causal relationship between the music produced and the human motions creating them to address these challenges. We propose VioPose: a novel multimodal network that hierarchically estimates dynamics. High-level features are cascaded to low-level features and integrated into Bayesian updates. Our architecture is shown to produce accurate pose sequences, facilitating precise motion analysis, and outperforms SoTA. As part of this work, we collected the largest and the most diverse calibrated violin-playing dataset, including video, sound, and 3D motion capture poses. Code and dataset can be found in our project page \url{https://sj-yoo.info/viopose/}.

Paper Structure

This paper contains 25 sections, 6 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: A 4D pose estimation in a violin performance, which features fine grained motion (left hand vibrato, $\approx$ 10 mm perturbation) and large motions (right hand bowing motion). VioPose successfully estimates both motions, while other approaches fail. See Figs. \ref{['fig:vibrato']} and \ref{['fig:trajectory']} for detailed real experimental results.
  • Figure 2: Overview of our proposed VioPose, a multimodal hierarchical 4D pose estimation pipeline, which receives 2D keypoints computed by an off-the-shelf algorithm and corresponding music-playing audio. Black and purple solid circles represent concatenation and averaging operations, respectively. The architecture consists of three main components: the single modality encoder (green and orange box, $\S$\ref{['sec:single_modality']}), the hierarchy module (blue box, $\S$\ref{['sec:hierarchy']}), and the mixing module (purple box, $\S$\ref{['sec:mixing']}).
  • Figure 3: Our dataset was recorded from 4 different camera views (left figure) with video at $\approx$ 30 FPS and synchronized audio using smartphones. We have a total of 12 people with different gender, age, height, violin size, and body type (right figure).
  • Figure 4: Predicted right wrist trajectories (red line) and the ground truth 3D trajectories (green line) after Procrustes projection for better comparison. Each graph contains 90 frames (3 seconds).
  • Figure 5: Predicted left hand trajectories (red line) and the ground truth 3D trajectories (green line) in the vibrato movement after center alignment for better comparison. Most of the SoTAs estimate simple straight lines, but VioPose is able to estimate the fine vibrato motion. Note that MHFormer looks like it is able to estimate vibrato but the movement contains high jitter estimation. We can verify this from the trajectories in Fig. \ref{['fig:trajectory']}, or the MPJVE and MPJAE metrics in Table \ref{['tab:result_human3d_pose']}).
  • ...and 1 more figures