LLM-Guided Reinforcement Learning for Audio-Visual Speech Enhancement

Chih-Ning Chen; Jen-Cheng Hou; Hsin-Min Wang; Shao-Yi Chien; Yu Tsao; Fan-Gang Zeng

LLM-Guided Reinforcement Learning for Audio-Visual Speech Enhancement

Chih-Ning Chen, Jen-Cheng Hou, Hsin-Min Wang, Shao-Yi Chien, Yu Tsao, Fan-Gang Zeng

Abstract

In existing Audio-Visual Speech Enhancement (AVSE) methods, objectives such as Scale-Invariant Signal-to-Noise Ratio (SI-SNR) and Mean Squared Error (MSE) are widely used; however, they often correlate poorly with perceptual quality and provide limited interpretability for optimization. This work proposes a reinforcement learning-based AVSE framework with a Large Language Model (LLM)-based interpretable reward model. An audio LLM generates natural language descriptions of enhanced speech, which are converted by a sentiment analysis model into a 1-5 rating score serving as the PPO reward for fine-tuning a pretrained AVSE model. Compared with scalar metrics, LLM-generated feedback is semantically rich and explicitly describes improvements in speech quality. Experiments on the 4th COG-MHEAR AVSE Challenge (AVSEC-4) dataset show that the proposed method outperforms a supervised baseline and a DNSMOS-based RL baseline in PESQ, STOI, neural quality metrics, and subjective listening tests.

LLM-Guided Reinforcement Learning for Audio-Visual Speech Enhancement

Abstract

Paper Structure (16 sections, 12 equations, 4 figures, 1 table)

This paper contains 16 sections, 12 equations, 4 figures, 1 table.

Introduction
Methodology
The AVSE Model
SE as a Reinforcement Learning Problem
Reward Model Design
Experimental Setup
Dataset
Training
Baselines
Results
Objective Quality Metrics
Subjective Evaluation
Interpretable Reward Analysis
Discussion
Conclusion
...and 1 more sections

Figures (4)

Figure 1: The training procedure of the proposed LR-AVSE framework.
Figure 2: Pipeline of the LLM-based interpretable reward generation.
Figure 3: LR-AVSE inference with SALMONN. Rewards from textual descriptions align with PESQ and STOI, demonstrating LR-AVSE’s interpretability.
Figure 4: A/B preference test results on the AVSEC-4 test set.

LLM-Guided Reinforcement Learning for Audio-Visual Speech Enhancement

Abstract

LLM-Guided Reinforcement Learning for Audio-Visual Speech Enhancement

Authors

Abstract

Table of Contents

Figures (4)