Can We Estimate Purchase Intention Based on Zero-shot Speech Emotion Recognition?
Ryotaro Nagase, Takashi Sumiyoshi, Natsuo Yamashita, Kota Dohi, Yohei Kawaguchi
TL;DR
The paper tackles estimating purchase intention from speech in a zero-shot setting by extending the CLAP framework to multi-class and multi-task SER, allowing emotion categories to be defined via sentences. It introduces a multi-class multi-task CLAP and paraphrase-based data augmentation to broaden textual labels, enabling zero-shot estimation of bipolar emotions. Experiments on a Japanese dataset show that zero-shot estimates can match supervised models, especially with augmentation, highlighting the practical potential for detecting purchase intent from speech without task-specific labeling. This work advances SER toward flexible, unseen-emotion inference with direct applicability to sales and call-center analytics.
Abstract
This paper proposes a zero-shot speech emotion recognition (SER) method that estimates emotions not previously defined in the SER model training. Conventional methods are limited to recognizing emotions defined by a single word. Moreover, we have the motivation to recognize unknown bipolar emotions such as ``I want to buy - I do not want to buy.'' In order to allow the model to define classes using sentences freely and to estimate unknown bipolar emotions, our proposed method expands upon the contrastive language-audio pre-training (CLAP) framework by introducing multi-class and multi-task settings. We also focus on purchase intention as a bipolar emotion and investigate the model's performance to zero-shot estimate it. This study is the first attempt to estimate purchase intention from speech directly. Experiments confirm that the results of zero-shot estimation by the proposed method are at the same level as those of the model trained by supervised learning.
