STEVE-Audio: Expanding the Goal Conditioning Modalities of Embodied Agents in Minecraft

Nicholas Lenzen; Amogh Raut; Andrew Melnik

STEVE-Audio: Expanding the Goal Conditioning Modalities of Embodied Agents in Minecraft

Nicholas Lenzen, Amogh Raut, Andrew Melnik

TL;DR

STEVE-Audio extends the STEVE-1 framework by enabling audio prompting in Minecraft through an Audio-Video CLIP foundation model and a CVAE prior that maps audio embeddings to the policy’s latent space. The approach trains frozen audio and video encoders with trainable transformation networks to form a shared cross-modal latent space, and uses a prior to translate audio embeddings into MineCLIP embeddings that condition the STEVE-1 policy. Empirical results show that audio prompting often outperforms visual and, in many cases, text prompting on short-horizon item-collection tasks, while also highlighting tradeoffs in ambiguity and prompt engineering across modalities. The work provides open-source training/evaluation code and a large Audio-Video Minecraft dataset, advancing multi-modal generalist sequential decision-making for embodied agents.

Abstract

Recently, the STEVE-1 approach has been introduced as a method for training generative agents to follow instructions in the form of latent CLIP embeddings. In this work, we present a methodology to extend the control modalities by learning a mapping from new input modalities to the latent goal space of the agent. We apply our approach to the challenging Minecraft domain, and extend the goal conditioning to include the audio modality. The resulting audio-conditioned agent is able to perform on a comparable level to the original text-conditioned and visual-conditioned agents. Specifically, we create an Audio-Video CLIP foundation model for Minecraft and an audio prior network which together map audio samples to the latent goal space of the STEVE-1 policy. Additionally, we highlight the tradeoffs that occur when conditioning on different modalities. Our training code, evaluation code, and Audio-Video CLIP foundation model for Minecraft are made open-source to help foster further research into multi-modal generalist sequential decision-making agents.

STEVE-Audio: Expanding the Goal Conditioning Modalities of Embodied Agents in Minecraft

TL;DR

Abstract

STEVE-Audio: Expanding the Goal Conditioning Modalities of Embodied Agents in Minecraft

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)