Table of Contents
Fetching ...

Audio Matters Too! Enhancing Markerless Motion Capture with Audio Signals for String Performance Capture

Yitong Jin, Zhiping Qiu, Yi Shi, Shuangpeng Sun, Chongwu Wang, Donghao Pan, Jiachen Zhao, Zhenghao Liang, Yuan Wang, Xiaobing Li, Feng Yu, Tao Yu, Qionghai Dai

TL;DR

This work tackles markerless 3D motion capture for string instrument performance, where subtle finger and string interactions are hard to capture visually. It introduces the String Performance Dataset (SPD), a large-scale multi-view, multi-modal dataset featuring cello and violin performances with synchronized audio and detailed motion annotations. The authors propose an audio-guided multi-modal MoCap framework that uses pitch information from audio to constrain and correct visually estimated hand poses, notably improving the accuracy of the note-playing finger and the contact relationship with the instrument. Ablation studies and qualitative results demonstrate the method's effectiveness, suggesting that audio cues can significantly augment visual MoCap in delicate instrument-performance scenarios. SPD establishes a new benchmark for multimodal string performance data, enabling future work in pedagogy, animation, and music-driven motion synthesis.

Abstract

In this paper, we touch on the problem of markerless multi-modal human motion capture especially for string performance capture which involves inherently subtle hand-string contacts and intricate movements. To fulfill this goal, we first collect a dataset, named String Performance Dataset (SPD), featuring cello and violin performances. The dataset includes videos captured from up to 23 different views, audio signals, and detailed 3D motion annotations of the body, hands, instrument, and bow. Moreover, to acquire the detailed motion annotations, we propose an audio-guided multi-modal motion capture framework that explicitly incorporates hand-string contacts detected from the audio signals for solving detailed hand poses. This framework serves as a baseline for string performance capture in a completely markerless manner without imposing any external devices on performers, eliminating the potential of introducing distortion in such delicate movements. We argue that the movements of performers, particularly the sound-producing gestures, contain subtle information often elusive to visual methods but can be inferred and retrieved from audio cues. Consequently, we refine the vision-based motion capture results through our innovative audio-guided approach, simultaneously clarifying the contact relationship between the performer and the instrument, as deduced from the audio. We validate the proposed framework and conduct ablation studies to demonstrate its efficacy. Our results outperform current state-of-the-art vision-based algorithms, underscoring the feasibility of augmenting visual motion capture with audio modality. To the best of our knowledge, SPD is the first dataset for musical instrument performance, covering fine-grained hand motion details in a multi-modal, large-scale collection.

Audio Matters Too! Enhancing Markerless Motion Capture with Audio Signals for String Performance Capture

TL;DR

This work tackles markerless 3D motion capture for string instrument performance, where subtle finger and string interactions are hard to capture visually. It introduces the String Performance Dataset (SPD), a large-scale multi-view, multi-modal dataset featuring cello and violin performances with synchronized audio and detailed motion annotations. The authors propose an audio-guided multi-modal MoCap framework that uses pitch information from audio to constrain and correct visually estimated hand poses, notably improving the accuracy of the note-playing finger and the contact relationship with the instrument. Ablation studies and qualitative results demonstrate the method's effectiveness, suggesting that audio cues can significantly augment visual MoCap in delicate instrument-performance scenarios. SPD establishes a new benchmark for multimodal string performance data, enabling future work in pedagogy, animation, and music-driven motion synthesis.

Abstract

In this paper, we touch on the problem of markerless multi-modal human motion capture especially for string performance capture which involves inherently subtle hand-string contacts and intricate movements. To fulfill this goal, we first collect a dataset, named String Performance Dataset (SPD), featuring cello and violin performances. The dataset includes videos captured from up to 23 different views, audio signals, and detailed 3D motion annotations of the body, hands, instrument, and bow. Moreover, to acquire the detailed motion annotations, we propose an audio-guided multi-modal motion capture framework that explicitly incorporates hand-string contacts detected from the audio signals for solving detailed hand poses. This framework serves as a baseline for string performance capture in a completely markerless manner without imposing any external devices on performers, eliminating the potential of introducing distortion in such delicate movements. We argue that the movements of performers, particularly the sound-producing gestures, contain subtle information often elusive to visual methods but can be inferred and retrieved from audio cues. Consequently, we refine the vision-based motion capture results through our innovative audio-guided approach, simultaneously clarifying the contact relationship between the performer and the instrument, as deduced from the audio. We validate the proposed framework and conduct ablation studies to demonstrate its efficacy. Our results outperform current state-of-the-art vision-based algorithms, underscoring the feasibility of augmenting visual motion capture with audio modality. To the best of our knowledge, SPD is the first dataset for musical instrument performance, covering fine-grained hand motion details in a multi-modal, large-scale collection.
Paper Structure (11 sections, 3 equations, 7 figures, 2 tables)

This paper contains 11 sections, 3 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: The pipeline combines the information extracted from the visual and auditory inputs. The following sections elaborate on detailed explanations of each part.
  • Figure 2: Our recording setup, with slight differences between the cello and violin scenarios in the camera numbers and positions. 20 cameras for the cello and 23 cameras for the violin.
  • Figure 3: Illustration of cello keypoints.
  • Figure 4: Using the cello as a reference, we present examples of various note-playing finger positions alongside their corresponding vibrating lengths (highlighted in red) and pitch values. (a) represents an open string note, producing the lowest pitch achievable on the excited string. (b) and (c) demonstrate different finger positions on the same string, resulting in different pitches. (c) and (d) illustrate instances in which varying finger positions on different strings produce the same pitch.
  • Figure 5: Demonstration of our results through reprojection and 3D visualization from different views.
  • ...and 2 more figures