Table of Contents
Fetching ...

Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

Shiqi Yang, Zhi Zhong, Mengjie Zhao, Shusuke Takahashi, Masato Ishii, Takashi Shibuya, Yuki Mitsufuji

TL;DR

This work tackles multi-modal audio-visual generation by introducing a lightweight, non-autoregressive Transformer that operates directly on discrete VQGAN tokens. Trained with mask denoising, the model supports image2audio, audio2image, and co-generation, and can leverage classifier-free guidance without additional training. On VGGSound, it achieves strong image2audio performance, often surpassing diffusion-based baselines, while offering faster inference and a simpler training pipeline. Overall, the approach provides a practical and scalable baseline for cross-modal synthesis and highlights the potential of masked transformers in audio-visual generation.

Abstract

In recent years, with the realistic generation results and a wide range of personalized applications, diffusion-based generative models gain huge attention in both visual and audio generation areas. Compared to the considerable advancements of text2image or text2audio generation, research in audio2visual or visual2audio generation has been relatively slow. The recent audio-visual generation methods usually resort to huge large language model or composable diffusion models. Instead of designing another giant model for audio-visual generation, in this paper we take a step back showing a simple and lightweight generative transformer, which is not fully investigated in multi-modal generation, can achieve excellent results on image2audio generation. The transformer operates in the discrete audio and visual Vector-Quantized GAN space, and is trained in the mask denoising manner. After training, the classifier-free guidance could be deployed off-the-shelf achieving better performance, without any extra training or modification. Since the transformer model is modality symmetrical, it could also be directly deployed for audio2image generation and co-generation. In the experiments, we show that our simple method surpasses recent image2audio generation methods. Generated audio samples can be found at https://docs.google.com/presentation/d/1ZtC0SeblKkut4XJcRaDsSTuCRIXB3ypxmSi7HTY3IyQ/

Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

TL;DR

This work tackles multi-modal audio-visual generation by introducing a lightweight, non-autoregressive Transformer that operates directly on discrete VQGAN tokens. Trained with mask denoising, the model supports image2audio, audio2image, and co-generation, and can leverage classifier-free guidance without additional training. On VGGSound, it achieves strong image2audio performance, often surpassing diffusion-based baselines, while offering faster inference and a simpler training pipeline. Overall, the approach provides a practical and scalable baseline for cross-modal synthesis and highlights the potential of masked transformers in audio-visual generation.

Abstract

In recent years, with the realistic generation results and a wide range of personalized applications, diffusion-based generative models gain huge attention in both visual and audio generation areas. Compared to the considerable advancements of text2image or text2audio generation, research in audio2visual or visual2audio generation has been relatively slow. The recent audio-visual generation methods usually resort to huge large language model or composable diffusion models. Instead of designing another giant model for audio-visual generation, in this paper we take a step back showing a simple and lightweight generative transformer, which is not fully investigated in multi-modal generation, can achieve excellent results on image2audio generation. The transformer operates in the discrete audio and visual Vector-Quantized GAN space, and is trained in the mask denoising manner. After training, the classifier-free guidance could be deployed off-the-shelf achieving better performance, without any extra training or modification. Since the transformer model is modality symmetrical, it could also be directly deployed for audio2image generation and co-generation. In the experiments, we show that our simple method surpasses recent image2audio generation methods. Generated audio samples can be found at https://docs.google.com/presentation/d/1ZtC0SeblKkut4XJcRaDsSTuCRIXB3ypxmSi7HTY3IyQ/
Paper Structure (18 sections, 2 equations, 10 figures, 4 tables)

This paper contains 18 sections, 2 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: During training, we will randomly mask and drop 50% tokens, which may be either masked or not masked. We will pad the dropped positions (even if it is not masked originally) and masked positions with a learnable embedding before the transformer decoder.
  • Figure 2: Conditional generation pipeline by iteratively unmasking during inference.
  • Figure 3: (Top) Ablation study on the mask ratio during training, no classifier-free guidance training or inference are used. $T$ and $N$ in the figure denote the temperature and sampling iteration number during inference. (Bottom) Ablation study on classifier-free Guidance (CFG) guidance factor, where guidance factor of 1 means not applying CFG.
  • Figure 4: Ablation study on different temperature ($T$) and number of sampling iterations ($N$) during inference. The x and y axes denote $T$ and $N$ respectively. Larger size of the circle indicates better performance for the specific metric. The model here is trained with mask ratio sampled from the truncated Gaussian of mean 0.55, and no CFG training or inference is deployed.
  • Figure 5: Image and audio mel spectrogram reconstruction during training.
  • ...and 5 more figures