FinAudio: A Benchmark for Audio Large Language Models in Financial Applications
Yupeng Cao, Haohang Li, Yangyang Yu, Shashidhar Reddy Javaji, Yueru He, Jimin Huang, Qianqian Xie, Fabrizio Dimino, Xiao-yang Liu, K. P. Subbalakshmi, Meikang Qiu, Sophia Ananiadou, Jian-Yun Nie
TL;DR
FinAudio introduces the first AudioLLM benchmark tailored to financial audio, defining three practical tasks (short-clip ASR, long-recording ASR, and long-form summarization) and curating datasets totaling over 400 hours. It evaluates seven AudioLLMs, highlighting that short clips are transcriptionally easier than long-form audio, and that ASR quality strongly influences summarization accuracy. The results show open-source Whisper-v3 often outperforms closed models on ASR, while long-form audio remains a major challenge; the study also analyzes prompt robustness and error categories, emphasizing the need for domain adaptation. The work provides actionable directions for improving numerical recognition, financial terminology handling, and context length in AudioLLMs, and will release all datasets and code to accelerate progress in finance-focused audio intelligence.
Abstract
Audio Large Language Models (AudioLLMs) have received widespread attention and have significantly improved performance on audio tasks such as conversation, audio understanding, and automatic speech recognition (ASR). Despite these advancements, there is an absence of a benchmark for assessing AudioLLMs in financial scenarios, where audio data, such as earnings conference calls and CEO speeches, are crucial resources for financial analysis and investment decisions. In this paper, we introduce \textsc{FinAudio}, the first benchmark designed to evaluate the capacity of AudioLLMs in the financial domain. We first define three tasks based on the unique characteristics of the financial domain: 1) ASR for short financial audio, 2) ASR for long financial audio, and 3) summarization of long financial audio. Then, we curate two short and two long audio datasets, respectively, and develop a novel dataset for financial audio summarization, comprising the \textsc{FinAudio} benchmark. Then, we evaluate seven prevalent AudioLLMs on \textsc{FinAudio}. Our evaluation reveals the limitations of existing AudioLLMs in the financial domain and offers insights for improving AudioLLMs. All datasets and codes will be released.
