Table of Contents
Fetching ...

Aligning Audio-Visual Joint Representations with an Agentic Workflow

Shentong Mo, Yibing Song

TL;DR

This paper proposes to improve AV joint representations from a data-centric perspective by aligning audio signals to visual data and demonstrates the state-of-the-art performance of the proposed approach against previous baselines in diverse downstream tasks.

Abstract

Visual content and accompanied audio signals naturally formulate a joint representation to improve audio-visual (AV) related applications. While studies develop various AV representation learning frameworks, the importance of AV data alignment is usually undermined for achieving high-quality representation. We observe that an audio signal may contain background noise interference. Also, non-synchronization may appear between audio and video streams. These non-strict data alignment limits representation quality and downgrade application performance. In this paper, we propose to improve AV joint representations from a data-centric perspective by aligning audio signals to visual data. Our alignment is conducted in an agentic workflow controlled by an LLM-based assistant named AVAgent. For each input AV data pair, our AVAgent uses a multi-modal LLM to convert audio and visual data into language descriptions separately (i.e., tool use). Then, AVAgent reasons whether this paired data is aligned well and plans to edit the audio signal if needed (i.e., planning). The audio editing is executed by predefined actions that filter noise or augment data. Moreover, we use a VLM to evaluate how modified audio signals match the visual content and provide feedback to AVAgent (i.e., reflection). The tool use, planning, and reflection steps operate cyclically to become an agentic workflow where audio signals are gradually aligned to visual content. To this end, existing methods can directly leverage the aligned AV data via our agentic workflow to improve AV joint representations. The experimental results comprehensively demonstrate the state-of-the-art performance of the proposed approach against previous baselines in diverse downstream tasks.

Aligning Audio-Visual Joint Representations with an Agentic Workflow

TL;DR

This paper proposes to improve AV joint representations from a data-centric perspective by aligning audio signals to visual data and demonstrates the state-of-the-art performance of the proposed approach against previous baselines in diverse downstream tasks.

Abstract

Visual content and accompanied audio signals naturally formulate a joint representation to improve audio-visual (AV) related applications. While studies develop various AV representation learning frameworks, the importance of AV data alignment is usually undermined for achieving high-quality representation. We observe that an audio signal may contain background noise interference. Also, non-synchronization may appear between audio and video streams. These non-strict data alignment limits representation quality and downgrade application performance. In this paper, we propose to improve AV joint representations from a data-centric perspective by aligning audio signals to visual data. Our alignment is conducted in an agentic workflow controlled by an LLM-based assistant named AVAgent. For each input AV data pair, our AVAgent uses a multi-modal LLM to convert audio and visual data into language descriptions separately (i.e., tool use). Then, AVAgent reasons whether this paired data is aligned well and plans to edit the audio signal if needed (i.e., planning). The audio editing is executed by predefined actions that filter noise or augment data. Moreover, we use a VLM to evaluate how modified audio signals match the visual content and provide feedback to AVAgent (i.e., reflection). The tool use, planning, and reflection steps operate cyclically to become an agentic workflow where audio signals are gradually aligned to visual content. To this end, existing methods can directly leverage the aligned AV data via our agentic workflow to improve AV joint representations. The experimental results comprehensively demonstrate the state-of-the-art performance of the proposed approach against previous baselines in diverse downstream tasks.

Paper Structure

This paper contains 28 sections, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: A glimpse of the AVAgent workflow. Three steps (tool use, planning, and reflection) form a cyclic agentic workflow where audio signals are progressively aligned with the visual content for joint representation improvement.
  • Figure 2: Overview of the AVAgent framework. Tool use: For each audio-visual data pair, we employ a multi-modal Large Language Model (LLM) to convert audio and visual data into the language form, separately. Planning: The agent takes the AV data via text description and plans to edit the audio signal for alignment enhancement. Reflection: Subsequently, a Vision-Language Model (VLM) evaluates modifications to ensure that the audio adjustments appropriately match the visual content, and provides feedback to the agent. These steps form a cyclic agentic workflow where audio signals are progressively aligned with the visual content for enhanced joint representation.
  • Figure 3: Audio editing action illustrations. We design 8 actions to edit audio signals for AV alignment. The first 4 actions are set to reduce background noise interference, and the last 4 actions are set to coordinate audio signals to visual data. Our AVAgent plans to use these actions according to input AV data pairs.
  • Figure 4: An example of our agentic workflow. For one input AV pair, we use mLLMs to transform video and audio data into language descriptions, separately. Then, AVAgent reasons and plans for actions. After editing the audio, AVAgent performs a reflection to compute two scores. As these scores are relatively low, they are sent to AVAgent for consideration in the next cycle. The newly planned actions operate on the original input AV pair and achieve favorable scores in reflection. These actions are then identified for editing input audio signals.
  • Figure 5: Illustration of a full example (Lecture in a Large Hall) of our agent workflow.
  • ...and 2 more figures