audio2chart: End to End Audio Transcription into playable Guitar Hero charts
Riccardo Tripodi
TL;DR
The paper addresses automatic generation of Guitar Hero–style charts directly from audio. It formulates the task as an autoregressive sequence model with time-discretized tokens and audio conditioning, optimizing $P(y_t \mid y_{<t}, \mathbf{c})$ and leveraging an Encodec-based audio encoder to provide cross-attentive multimodal context. Key contributions include a time-discretized tokenization scheme with 63 token configurations, a strong unconditional baseline, and a scalable audio-conditioned Transformer architecture that outperforms the baseline on non-pad predictions, with 40 ms granularity offering the best trade-off between accuracy and efficiency. The work provides open-source code and pretrained models, enabling reproducible research and practical automatic chart generation from diverse audio inputs.
Abstract
This work introduces audio2chart, a framework for the automatic generation of Guitar Hero style charts directly from raw audio. The task is formalized as a sequence prediction problem, where models are trained to generate discrete chart tokens aligned with the audio on discrete time steps. An unconditional baseline demonstrates strong predictive performance, while the addition of audio conditioning yields consistent improvements across accuracy based metrics. This work demonstrates that incorporating audio conditioning is both feasible and effective for improving note prediction in automatic chart generation. The complete codebase for training and inference is publicly available on GitHub supporting reproducible research on neural chart generation. A family of pretrained models is released on Hugging Face.
