Table of Contents
Fetching ...

'Studies for': A Human-AI Co-Creative Sound Artwork Using a Real-time Multi-channel Sound Generation Model

Chihiro Nagashima, Akira Takahashi, Zhi Zhong, Shusuke Takahashi, Yuki Mitsufuji

TL;DR

Studies for investigates how AI can be integrated into sound art to preserve and extend an artist's practice after a lifetime of work. A lightweight yet high-quality T2A model, SpecMaskGIT, is trained on about 200 hours of Evala's past works and conditioned with CLAP text-audio signals to generate eight-channel sound in real time, including an audio outpainting mechanism for seamless continuity. The work demonstrates a practical Human-AI co-creation framework, integrates feedback loops with the artist, and proposes a new form of archive that extends the artist's practice beyond death. It provides a scalable blueprint for real-time, multi-channel generative sound installations and expands archiving concepts in sound art.

Abstract

This paper explores the integration of AI technologies into the artistic workflow through the creation of Studies for, a generative sound installation developed in collaboration with sound artist Evala (https://www.ntticc.or.jp/en/archive/works/studies-for/). The installation employs SpecMaskGIT, a lightweight yet high-quality sound generation AI model, to generate and playback eight-channel sound in real-time, creating an immersive auditory experience over the course of a three-month exhibition. The work is grounded in the concept of a "new form of archive," which aims to preserve the artistic style of an artist while expanding beyond artists' past artworks by continued generation of new sound elements. This speculative approach to archival preservation is facilitated by training the AI model on a dataset consisting of over 200 hours of Evala's past sound artworks. By addressing key requirements in the co-creation of art using AI, this study highlights the value of the following aspects: (1) the necessity of integrating artist feedback, (2) datasets derived from an artist's past works, and (3) ensuring the inclusion of unexpected, novel outputs. In Studies for, the model was designed to reflect the artist's artistic identity while generating new, previously unheard sounds, making it a fitting realization of the concept of "a new form of archive." We propose a Human-AI co-creation framework for effectively incorporating sound generation AI models into the sound art creation process and suggest new possibilities for creating and archiving sound art that extend an artist's work beyond their physical existence. Demo page: https://sony.github.io/studies-for/

'Studies for': A Human-AI Co-Creative Sound Artwork Using a Real-time Multi-channel Sound Generation Model

TL;DR

Studies for investigates how AI can be integrated into sound art to preserve and extend an artist's practice after a lifetime of work. A lightweight yet high-quality T2A model, SpecMaskGIT, is trained on about 200 hours of Evala's past works and conditioned with CLAP text-audio signals to generate eight-channel sound in real time, including an audio outpainting mechanism for seamless continuity. The work demonstrates a practical Human-AI co-creation framework, integrates feedback loops with the artist, and proposes a new form of archive that extends the artist's practice beyond death. It provides a scalable blueprint for real-time, multi-channel generative sound installations and expands archiving concepts in sound art.

Abstract

This paper explores the integration of AI technologies into the artistic workflow through the creation of Studies for, a generative sound installation developed in collaboration with sound artist Evala (https://www.ntticc.or.jp/en/archive/works/studies-for/). The installation employs SpecMaskGIT, a lightweight yet high-quality sound generation AI model, to generate and playback eight-channel sound in real-time, creating an immersive auditory experience over the course of a three-month exhibition. The work is grounded in the concept of a "new form of archive," which aims to preserve the artistic style of an artist while expanding beyond artists' past artworks by continued generation of new sound elements. This speculative approach to archival preservation is facilitated by training the AI model on a dataset consisting of over 200 hours of Evala's past sound artworks. By addressing key requirements in the co-creation of art using AI, this study highlights the value of the following aspects: (1) the necessity of integrating artist feedback, (2) datasets derived from an artist's past works, and (3) ensuring the inclusion of unexpected, novel outputs. In Studies for, the model was designed to reflect the artist's artistic identity while generating new, previously unheard sounds, making it a fitting realization of the concept of "a new form of archive." We propose a Human-AI co-creation framework for effectively incorporating sound generation AI models into the sound art creation process and suggest new possibilities for creating and archiving sound art that extend an artist's work beyond their physical existence. Demo page: https://sony.github.io/studies-for/

Paper Structure

This paper contains 14 sections, 1 equation, 6 figures, 1 table.

Figures (6)

  • Figure 1: Studies for, a collaborative generative sound installation created with sound artist Evala, exhibited at the NTT InterCommunication Center [ICC] in Tokyo from December 14, 2024, to March 9, 2025 (evala2024studiesfor). The installation space was enveloped in white fabric, with eight-channel speakers placed behind the fabric. The audience experienced the work by walking through the space. Photo by Maruo Ryuichi, courtesy of ICC.
  • Figure 2: Continuous Generation with Audio and Text Inputs.
  • Figure 3: Audio synthesis performance and number of synthesis iterations of different methods. The size of circle represents the model size. SpecMaskGIT achieves decent quality with only 16 iterations and a small model size.
  • Figure 4: SpecVQGAN, which encodes non-overlapping 16-by-16 time-mel patches into discrete tokens, and decodes the discrete tokens back to Mel-spectrogram
  • Figure 5: Self-supervised training of SpecMaskGIT. The Transformer is trained to reconstruct SpecVQGAN token sequences that are randomly masked with variable masking ratios, conditioned by a semantic embeddding from the CLAP encoder. “M” denotes the learned mask token, while “C” denotes the proposed conditional mask.
  • ...and 1 more figures