Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks
Chang Liu, Haomin Zhang, Shiyu Xia, Zihao Chen, Chaofan Ding, Xin Yue, Huizhe Chen, Xinhan Di
TL;DR
The paper tackles the problem of video-guided piano music generation by introducing the Chain-of-Perform (CoP) Benchmark Dataset, a fully open-source suite that explicitly links visual inputs to piano output through step-by-step thinking and generation stages. It proposes a 10-hour multimodal dataset with a five-view piano setup and a two-stage pipeline that decomposes reasoning and generation, enabling fine-grained semantic and temporal alignment. The authors define an Evaluation Metrics Suite with Thinking Metrics and Music Generation Metrics to assess both reasoning accuracy and audio-piano fidelity, including piano-specific MIDI precision and perceptual scores. The CoP benchmark aims to accelerate progress in high-fidelity, temporally synchronized video-to-piano synthesis by providing a structured, extensible resource for evaluating V2A/V2M models and guiding future research.
Abstract
Generating high-quality piano audio from video requires precise synchronization between visual cues and musical output, ensuring accurate semantic and temporal alignment.However, existing evaluation datasets do not fully capture the intricate synchronization required for piano music generation. A comprehensive benchmark is essential for two primary reasons: (1) existing metrics fail to reflect the complexity of video-to-piano music interactions, and (2) a dedicated benchmark dataset can provide valuable insights to accelerate progress in high-quality piano music generation. To address these challenges, we introduce the CoP Benchmark Dataset-a fully open-sourced, multimodal benchmark designed specifically for video-guided piano music generation. The proposed Chain-of-Perform (CoP) benchmark offers several compelling features: (1) detailed multimodal annotations, enabling precise semantic and temporal alignment between video content and piano audio via step-by-step Chain-of-Perform guidance; (2) a versatile evaluation framework for rigorous assessment of both general-purpose and specialized video-to-piano generation tasks; and (3) full open-sourcing of the dataset, annotations, and evaluation protocols. The dataset is publicly available at https://github.com/acappemin/Video-to-Audio-and-Piano, with a continuously updated leaderboard to promote ongoing research in this domain.
