Table of Contents
Fetching ...

ELEGANCE: Efficient LLM Guidance for Audio-Visual Target Speech Extraction

Wenxuan Wu, Shuai Wang, Xixin Wu, Helen Meng, Haizhou Li

TL;DR

ELEGANCE addresses the semantic blindness of AV-TSE by injecting linguistic knowledge from large language models into AV-TSE during training. It introduces three plug-and-play guidance strategies—linguistic constraints, linguistic prediction, and linguistic prior—that align textual and speech representations and reinforce extraction without adding inference cost. Across two AV-TSE backbones (USEV and AV-Mamba) and multiple LLMs (RoBERTa, Qwen3-0.6B, Qwen3-4B), the approach improves performance in visually impaired, multilingual, switching, and out-of-domain scenarios, with larger gains from bigger AR LLMs. The results demonstrate robust cross-lingual transfer, improved resilience to interference, and practical applicability for low-resource languages, making linguistic guidance a viable augmentation for multimodal speech extraction.

Abstract

Audio-visual target speaker extraction (AV-TSE) models primarily rely on visual cues from the target speaker. However, humans also leverage linguistic knowledge, such as syntactic constraints, next word prediction, and prior knowledge of conversation, to extract target speech. Inspired by this observation, we propose ELEGANCE, a novel framework that incorporates linguistic knowledge from large language models (LLMs) into AV-TSE models through three distinct guidance strategies: output linguistic constraints, intermediate linguistic prediction, and input linguistic prior. Comprehensive experiments with RoBERTa, Qwen3-0.6B, and Qwen3-4B on two AV-TSE backbones demonstrate the effectiveness of our approach. Significant improvements are observed in challenging scenarios, including visual cue impaired, unseen languages, target speaker switches, increased interfering speakers, and out-of-domain test set. Demo page: https://alexwxwu.github.io/ELEGANCE/.

ELEGANCE: Efficient LLM Guidance for Audio-Visual Target Speech Extraction

TL;DR

ELEGANCE addresses the semantic blindness of AV-TSE by injecting linguistic knowledge from large language models into AV-TSE during training. It introduces three plug-and-play guidance strategies—linguistic constraints, linguistic prediction, and linguistic prior—that align textual and speech representations and reinforce extraction without adding inference cost. Across two AV-TSE backbones (USEV and AV-Mamba) and multiple LLMs (RoBERTa, Qwen3-0.6B, Qwen3-4B), the approach improves performance in visually impaired, multilingual, switching, and out-of-domain scenarios, with larger gains from bigger AR LLMs. The results demonstrate robust cross-lingual transfer, improved resilience to interference, and practical applicability for low-resource languages, making linguistic guidance a viable augmentation for multimodal speech extraction.

Abstract

Audio-visual target speaker extraction (AV-TSE) models primarily rely on visual cues from the target speaker. However, humans also leverage linguistic knowledge, such as syntactic constraints, next word prediction, and prior knowledge of conversation, to extract target speech. Inspired by this observation, we propose ELEGANCE, a novel framework that incorporates linguistic knowledge from large language models (LLMs) into AV-TSE models through three distinct guidance strategies: output linguistic constraints, intermediate linguistic prediction, and input linguistic prior. Comprehensive experiments with RoBERTa, Qwen3-0.6B, and Qwen3-4B on two AV-TSE backbones demonstrate the effectiveness of our approach. Significant improvements are observed in challenging scenarios, including visual cue impaired, unseen languages, target speaker switches, increased interfering speakers, and out-of-domain test set. Demo page: https://alexwxwu.github.io/ELEGANCE/.

Paper Structure

This paper contains 39 sections, 6 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Three different LLM guidance strategies for AV-TSE, from left to right: (a) Output guidance: AV-TSE with linguistic constraints, where both PLM and PSLM will be utilized during training with an adapter to align latent semantic space. (b) Intermediate guidance: AV-TSE with Linguistic Prediction. Here, we use decoder-only LLM (Qwen) as an example; however, the strategy could also be applied to encoder-only LLMs, such as Roberta. By using causal architecture with a cascade transformer decoder, the fusion method will be the same. (c) Input guidance: AV-TSE with Linguistic Prior. The LLM guidance and zero embedding will be added in an interleaved way. The goal is to reduce the over-reliance of the AV-TSE model on transcripts, which are not available during inference. Among all three strategies, the modules in the gray dotted lines will only be used during training and will be dropped during inference. The blue dotted lines denote the trajectory of linguistic knowledge injection.
  • Figure 2: Examples of visual normal and impairment scenarios.
  • Figure 3: SI-SDR-i results on five monolingual test sets, USEV-EN and USEV-I-Roberta-EN denote baseline and baseline with input guidance strategy using Roberta-base. Both USEV-EN and USEV-I-Roberta-EN are trained on the English monolingual training set with clean visual cues.
  • Figure 4: Mel-spectrograms of USEV and AV-Mamba extraction results using three proposed strategies with Roberta.
  • Figure 5: Mel-spectrograms of USEV extraction results with the combination of three proposed strategies with Roberta, Qwen3-0.6B, and Qwen3-4B.
  • ...and 1 more figures