Table of Contents
Fetching ...

audio2chart: End to End Audio Transcription into playable Guitar Hero charts

Riccardo Tripodi

TL;DR

The paper addresses automatic generation of Guitar Hero–style charts directly from audio. It formulates the task as an autoregressive sequence model with time-discretized tokens and audio conditioning, optimizing $P(y_t \mid y_{<t}, \mathbf{c})$ and leveraging an Encodec-based audio encoder to provide cross-attentive multimodal context. Key contributions include a time-discretized tokenization scheme with 63 token configurations, a strong unconditional baseline, and a scalable audio-conditioned Transformer architecture that outperforms the baseline on non-pad predictions, with 40 ms granularity offering the best trade-off between accuracy and efficiency. The work provides open-source code and pretrained models, enabling reproducible research and practical automatic chart generation from diverse audio inputs.

Abstract

This work introduces audio2chart, a framework for the automatic generation of Guitar Hero style charts directly from raw audio. The task is formalized as a sequence prediction problem, where models are trained to generate discrete chart tokens aligned with the audio on discrete time steps. An unconditional baseline demonstrates strong predictive performance, while the addition of audio conditioning yields consistent improvements across accuracy based metrics. This work demonstrates that incorporating audio conditioning is both feasible and effective for improving note prediction in automatic chart generation. The complete codebase for training and inference is publicly available on GitHub supporting reproducible research on neural chart generation. A family of pretrained models is released on Hugging Face.

audio2chart: End to End Audio Transcription into playable Guitar Hero charts

TL;DR

The paper addresses automatic generation of Guitar Hero–style charts directly from audio. It formulates the task as an autoregressive sequence model with time-discretized tokens and audio conditioning, optimizing and leveraging an Encodec-based audio encoder to provide cross-attentive multimodal context. Key contributions include a time-discretized tokenization scheme with 63 token configurations, a strong unconditional baseline, and a scalable audio-conditioned Transformer architecture that outperforms the baseline on non-pad predictions, with 40 ms granularity offering the best trade-off between accuracy and efficiency. The work provides open-source code and pretrained models, enabling reproducible research and practical automatic chart generation from diverse audio inputs.

Abstract

This work introduces audio2chart, a framework for the automatic generation of Guitar Hero style charts directly from raw audio. The task is formalized as a sequence prediction problem, where models are trained to generate discrete chart tokens aligned with the audio on discrete time steps. An unconditional baseline demonstrates strong predictive performance, while the addition of audio conditioning yields consistent improvements across accuracy based metrics. This work demonstrates that incorporating audio conditioning is both feasible and effective for improving note prediction in automatic chart generation. The complete codebase for training and inference is publicly available on GitHub supporting reproducible research on neural chart generation. A family of pretrained models is released on Hugging Face.

Paper Structure

This paper contains 14 sections, 6 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Example of a dataset sample. Each gameplay song is represented by its audio track and a sequence of note events, each annotated with additional attributes such as note duration, HOPO, star power, and other in-game mechanics. For simplicity the timestamp here is represented as an integer number but in practice it can be any real number since a note can be placed at any time. Notes happening at the same time are represented as different rows with the same timestamp.
  • Figure 2: Cumulative distribution function of the time difference between consecutive notes for each difficulty level. Zoom on the first 0.5 seconds.
  • Figure 3: Baseline performance in terms of perplexity and prediction accuracy as a function of the maximum context length. The x-axis corresponds to 128, 256, 512 and 1024 tokens, which roughly correspond to 15 s, 30 s, 60 s, and the full song for the Expert difficulty level.
  • Figure 4: Overview of the proposed multimodal architecture. Audio is encoded in 30 ms frames and processed through a pretrained Encodec encoder followed by a lightweight adapter. The resulting continuous representations are fused with the symbolic token sequence via cross-attention in a Transformer decoder.
  • Figure 5: Bin density of the number of notes per song for each difficulty level.
  • ...and 5 more figures