Table of Contents
Fetching ...

Speech-Copilot: Leveraging Large Language Models for Speech Processing via Task Decomposition, Modularization, and Program Generation

Chun-Yi Kuan, Chih-Kai Yang, Wei-Ping Huang, Ke-Han Lu, Hung-yi Lee

TL;DR

Speech-Copilot is introduced, a modular framework for instruction-oriented speech-processing tasks that minimizes human effort in toolset construction that achieves state-of the-art performance on the Dynamic-SUPERB benchmark, demonstrating its effectiveness across diverse speech-processing tasks.

Abstract

In this work, we introduce Speech-Copilot, a modular framework for instruction-oriented speech-processing tasks that minimizes human effort in toolset construction. Unlike end-to-end methods using large audio-language models, Speech-Copilot builds speech processing-specific toolsets by analyzing pre-collected task instructions and breaking tasks into manageable sub-tasks. It features a flexible agent based on large language models that performs tasks through program generation. Our approach achieves state-of-the-art performance on the Dynamic-SUPERB benchmark, demonstrating its effectiveness across diverse speech-processing tasks. Key contributions include: 1) developing an innovative framework for speech processing-specific toolset construction, 2) establishing a high-performing agent based on large language models, and 3) offering a new perspective on addressing challenging instruction-oriented speech-processing tasks. Without additional training processes required by end-to-end approaches, our method provides a flexible and extendable solution for a wide range of speech-processing applications.

Speech-Copilot: Leveraging Large Language Models for Speech Processing via Task Decomposition, Modularization, and Program Generation

TL;DR

Speech-Copilot is introduced, a modular framework for instruction-oriented speech-processing tasks that minimizes human effort in toolset construction that achieves state-of the-art performance on the Dynamic-SUPERB benchmark, demonstrating its effectiveness across diverse speech-processing tasks.

Abstract

In this work, we introduce Speech-Copilot, a modular framework for instruction-oriented speech-processing tasks that minimizes human effort in toolset construction. Unlike end-to-end methods using large audio-language models, Speech-Copilot builds speech processing-specific toolsets by analyzing pre-collected task instructions and breaking tasks into manageable sub-tasks. It features a flexible agent based on large language models that performs tasks through program generation. Our approach achieves state-of-the-art performance on the Dynamic-SUPERB benchmark, demonstrating its effectiveness across diverse speech-processing tasks. Key contributions include: 1) developing an innovative framework for speech processing-specific toolset construction, 2) establishing a high-performing agent based on large language models, and 3) offering a new perspective on addressing challenging instruction-oriented speech-processing tasks. Without additional training processes required by end-to-end approaches, our method provides a flexible and extendable solution for a wide range of speech-processing applications.
Paper Structure (22 sections, 1 equation, 2 figures, 4 tables)

This paper contains 22 sections, 1 equation, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Overview of Speech-Copilot with the toolset construction and the program generation phases. During the toolset construction, we first conduct task decomposition to decompose diverse speech-processing task instructions into fundamental sub-tasks. Next, task modularization is performed to transform the sub-tasks into documented modules with LLM, manually implemented with scientifically grounded models. Finally, in the program generation phase, programs are generated by LLM based on the user query and executed on the audio input to get the result. Please refer to the demo pagedemo for more details about prompts.
  • Figure 2: The results of Speech-Copilot on multi-task examples.