Table of Contents
Fetching ...

STEVE-Audio: Expanding the Goal Conditioning Modalities of Embodied Agents in Minecraft

Nicholas Lenzen, Amogh Raut, Andrew Melnik

TL;DR

STEVE-Audio extends the STEVE-1 framework by enabling audio prompting in Minecraft through an Audio-Video CLIP foundation model and a CVAE prior that maps audio embeddings to the policy’s latent space. The approach trains frozen audio and video encoders with trainable transformation networks to form a shared cross-modal latent space, and uses a prior to translate audio embeddings into MineCLIP embeddings that condition the STEVE-1 policy. Empirical results show that audio prompting often outperforms visual and, in many cases, text prompting on short-horizon item-collection tasks, while also highlighting tradeoffs in ambiguity and prompt engineering across modalities. The work provides open-source training/evaluation code and a large Audio-Video Minecraft dataset, advancing multi-modal generalist sequential decision-making for embodied agents.

Abstract

Recently, the STEVE-1 approach has been introduced as a method for training generative agents to follow instructions in the form of latent CLIP embeddings. In this work, we present a methodology to extend the control modalities by learning a mapping from new input modalities to the latent goal space of the agent. We apply our approach to the challenging Minecraft domain, and extend the goal conditioning to include the audio modality. The resulting audio-conditioned agent is able to perform on a comparable level to the original text-conditioned and visual-conditioned agents. Specifically, we create an Audio-Video CLIP foundation model for Minecraft and an audio prior network which together map audio samples to the latent goal space of the STEVE-1 policy. Additionally, we highlight the tradeoffs that occur when conditioning on different modalities. Our training code, evaluation code, and Audio-Video CLIP foundation model for Minecraft are made open-source to help foster further research into multi-modal generalist sequential decision-making agents.

STEVE-Audio: Expanding the Goal Conditioning Modalities of Embodied Agents in Minecraft

TL;DR

STEVE-Audio extends the STEVE-1 framework by enabling audio prompting in Minecraft through an Audio-Video CLIP foundation model and a CVAE prior that maps audio embeddings to the policy’s latent space. The approach trains frozen audio and video encoders with trainable transformation networks to form a shared cross-modal latent space, and uses a prior to translate audio embeddings into MineCLIP embeddings that condition the STEVE-1 policy. Empirical results show that audio prompting often outperforms visual and, in many cases, text prompting on short-horizon item-collection tasks, while also highlighting tradeoffs in ambiguity and prompt engineering across modalities. The work provides open-source training/evaluation code and a large Audio-Video Minecraft dataset, advancing multi-modal generalist sequential decision-making for embodied agents.

Abstract

Recently, the STEVE-1 approach has been introduced as a method for training generative agents to follow instructions in the form of latent CLIP embeddings. In this work, we present a methodology to extend the control modalities by learning a mapping from new input modalities to the latent goal space of the agent. We apply our approach to the challenging Minecraft domain, and extend the goal conditioning to include the audio modality. The resulting audio-conditioned agent is able to perform on a comparable level to the original text-conditioned and visual-conditioned agents. Specifically, we create an Audio-Video CLIP foundation model for Minecraft and an audio prior network which together map audio samples to the latent goal space of the STEVE-1 policy. Additionally, we highlight the tradeoffs that occur when conditioning on different modalities. Our training code, evaluation code, and Audio-Video CLIP foundation model for Minecraft are made open-source to help foster further research into multi-modal generalist sequential decision-making agents.

Paper Structure

This paper contains 22 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Examples of evaluation tasks within the Minecraft environment, showcasing the observed ego-centric views as the agent works to achieve the corresponding objectives.
  • Figure 2: Our architecture for the Audio-Video CLIP model learns a shared latent space by jointly training the audio and video transformation networks, which are versions of the mapping network used by StyleGAN 3 Karras2021. We utilized frozen pretrained MineCLIP fan2022minedojo model for the video encoder and frozen pretrained Audio Spectrogram Transformer for the audio encoder gong21b_interspeech.
  • Figure 3: Our architecture for audio prompting of the STEVE-1 agent lifshitz2023steve1.
  • Figure 4: Performance comparison of audio-conditioned STEVE-1 (created using our proposed methodology) with the original text-conditioned and visual-conditioned STEVE-1 agents. The last row consists of three evaluation with each prompting modality on "place" tasks which have ambiguous or underspecified audio prompts (i.e., audio samples for placing dirt and sand sound very similar to the audio samples for digging dirt and sand). Results indicate that audio prompting fails to effectively condition the STEVE-1 policy in these ambiguous or underspecified scenarios. The black bars represent the 10th, 50th, and 90th percentiles, indicating performance spread across the different modalities.