Table of Contents
Fetching ...

LLM2Fx-Tools: Tool Calling For Music Post-Production

Seungheon Doh, Junghyun Koo, Marco A. Martínez-Ramírez, Woosung Choi, Wei-Hsiang Liao, Qiyu Wu, Juhan Nam, Yuki Mitsufuji

TL;DR

The paper introduces LLM2Fx-Tools, a multimodal, tool-calling framework that uses chain-of-thought planning to generate executable audio-effect chains for music post-production. It presents LP-Fx, a large dataset of 101K conversational examples with structured tool calls, CoT, and responses to train and evaluate the system. By bridging an audio encoder and an audio-language adapter with a fine-tuned LLM, the approach achieves strong performance in Fx-chain planning, parameter estimation, and style-transfer tasks, while enabling interpretable, natural-language explanations and tool-based execution. Analyses including LLM-as-a-judge corroborate high tool-calling accuracy and reasoning quality, suggesting practical potential for controllable music production workflows and future expansion to richer plugin ecosystems.

Abstract

This paper introduces LLM2Fx-Tools, a multimodal tool-calling framework that generates executable sequences of audio effects (Fx-chain) for music post-production. LLM2Fx-Tools uses a large language model (LLM) to understand audio inputs, select audio effects types, determine their order, and estimate parameters, guided by chain-of-thought (CoT) planning. We also present LP-Fx, a new instruction-following dataset with structured CoT annotations and tool calls for audio effects modules. Experiments show that LLM2Fx-Tools can infer an Fx-chain and its parameters from pairs of unprocessed and processed audio, enabled by autoregressive sequence modeling, tool calling, and CoT reasoning. We further validate the system in a style transfer setting, where audio effects information is transferred from a reference source and applied to new content. Finally, LLM-as-a-judge evaluation demonstrates that our approach generates appropriate CoT reasoning and responses for music production queries. To our knowledge, this is the first work to apply LLM-based tool calling to audio effects modules, enabling interpretable and controllable music production.

LLM2Fx-Tools: Tool Calling For Music Post-Production

TL;DR

The paper introduces LLM2Fx-Tools, a multimodal, tool-calling framework that uses chain-of-thought planning to generate executable audio-effect chains for music post-production. It presents LP-Fx, a large dataset of 101K conversational examples with structured tool calls, CoT, and responses to train and evaluate the system. By bridging an audio encoder and an audio-language adapter with a fine-tuned LLM, the approach achieves strong performance in Fx-chain planning, parameter estimation, and style-transfer tasks, while enabling interpretable, natural-language explanations and tool-based execution. Analyses including LLM-as-a-judge corroborate high tool-calling accuracy and reasoning quality, suggesting practical potential for controllable music production workflows and future expansion to richer plugin ecosystems.

Abstract

This paper introduces LLM2Fx-Tools, a multimodal tool-calling framework that generates executable sequences of audio effects (Fx-chain) for music post-production. LLM2Fx-Tools uses a large language model (LLM) to understand audio inputs, select audio effects types, determine their order, and estimate parameters, guided by chain-of-thought (CoT) planning. We also present LP-Fx, a new instruction-following dataset with structured CoT annotations and tool calls for audio effects modules. Experiments show that LLM2Fx-Tools can infer an Fx-chain and its parameters from pairs of unprocessed and processed audio, enabled by autoregressive sequence modeling, tool calling, and CoT reasoning. We further validate the system in a style transfer setting, where audio effects information is transferred from a reference source and applied to new content. Finally, LLM-as-a-judge evaluation demonstrates that our approach generates appropriate CoT reasoning and responses for music production queries. To our knowledge, this is the first work to apply LLM-based tool calling to audio effects modules, enabling interpretable and controllable music production.

Paper Structure

This paper contains 27 sections, 9 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: An illustration of the LLM2Fx-Tools framework. The input to LLM2Fx-Tools consists of instruction, available tools, reference audio, and (pseudo) dry audio that is preprocessed with audio effects removal and noramlization (Fx-Removal and Fx-Norm). The framework outputs chain of thought, tool calling procedure, and response. The generated tool calling outputs (Fx-chain) are then combined with tool environments (audio effects modules) to enable the transformation of new audio in the style of the reference audio.
  • Figure 2: Model Architecture
  • Figure 3: Data generation process for LP-Fx
  • Figure 4: Subjective evaluation on reverse engineering.