Table of Contents
Fetching ...

MSMT-FN: Multi-segment Multi-task Fusion Network for Marketing Audio Classification

HongYu Liu, Ruijie Wan, Yueju Han, Junxin Li, Liuxing Lu, Chao He, Lihua Cai

TL;DR

This work tackles marketing audio classification to predict purchase propensity from long Mandarin conversations. It introduces MSMT-FN, a multi-segment, multi-task fusion network that treats text as the backbone and enriches it with acoustic cues via cross-attention, bottleneck fusion, and Bi-GRU contextual modeling. A new MarketCalls dataset is curated and released upon request, and the method is evaluated across MarketCalls and standard benchmarks (CMU-MOSI, CMU-MOSEI, MELD) with code provided on GitHub. Results show MSMT-FN achieves strong performance and robust generalization, with ablations confirming the importance of audio augmentation, silence preservation, bottleneck fusion, and multi-task learning for business-relevant, cross-language audio analysis.

Abstract

Audio classification plays an essential role in sentiment analysis and emotion recognition, especially for analyzing customer attitudes in marketing phone calls. Efficiently categorizing customer purchasing propensity from large volumes of audio data remains challenging. In this work, we propose a novel Multi-Segment Multi-Task Fusion Network (MSMT-FN) that is uniquely designed for addressing this business demand. Evaluations conducted on our proprietary MarketCalls dataset, as well as established benchmarks (CMU-MOSI, CMU-MOSEI, and MELD), show MSMT-FN consistently outperforms or matches state-of-the-art methods. Additionally, our newly curated MarketCalls dataset will be available upon request, and the code base is made accessible at GitHub Repository MSMT-FN, to facilitate further research and advancements in audio classification domain.

MSMT-FN: Multi-segment Multi-task Fusion Network for Marketing Audio Classification

TL;DR

This work tackles marketing audio classification to predict purchase propensity from long Mandarin conversations. It introduces MSMT-FN, a multi-segment, multi-task fusion network that treats text as the backbone and enriches it with acoustic cues via cross-attention, bottleneck fusion, and Bi-GRU contextual modeling. A new MarketCalls dataset is curated and released upon request, and the method is evaluated across MarketCalls and standard benchmarks (CMU-MOSI, CMU-MOSEI, MELD) with code provided on GitHub. Results show MSMT-FN achieves strong performance and robust generalization, with ablations confirming the importance of audio augmentation, silence preservation, bottleneck fusion, and multi-task learning for business-relevant, cross-language audio analysis.

Abstract

Audio classification plays an essential role in sentiment analysis and emotion recognition, especially for analyzing customer attitudes in marketing phone calls. Efficiently categorizing customer purchasing propensity from large volumes of audio data remains challenging. In this work, we propose a novel Multi-Segment Multi-Task Fusion Network (MSMT-FN) that is uniquely designed for addressing this business demand. Evaluations conducted on our proprietary MarketCalls dataset, as well as established benchmarks (CMU-MOSI, CMU-MOSEI, and MELD), show MSMT-FN consistently outperforms or matches state-of-the-art methods. Additionally, our newly curated MarketCalls dataset will be available upon request, and the code base is made accessible at GitHub Repository MSMT-FN, to facilitate further research and advancements in audio classification domain.

Paper Structure

This paper contains 22 sections, 8 equations, 1 figure, 6 tables.

Figures (1)

  • Figure 1: MSMT-FN Network Architecture. Step 1: Preprocessing. Each audio recording is broken into segments, and the audio and textual channels are extracted and encoded to obtain embeddings of the two channels. Step 2: Feature Fusion Layer. The textual channel serves as the backbone channel and is independently put through self-attention blocks; while a separate complementary channel is created by fusing the audio channel with the text channel using cross- and self-attention blocks. Step 3: Bottleneck Fusion Layer. A bottleneck fusion mechanism is adopted to more effectively fuse both channels from step 2. Finally, BiGRU is adopted under a multi-task learning framework for all segments within one audio recording to generate classification prediction for different tasks.