Table of Contents
Fetching ...

FinAudio: A Benchmark for Audio Large Language Models in Financial Applications

Yupeng Cao, Haohang Li, Yangyang Yu, Shashidhar Reddy Javaji, Yueru He, Jimin Huang, Qianqian Xie, Fabrizio Dimino, Xiao-yang Liu, K. P. Subbalakshmi, Meikang Qiu, Sophia Ananiadou, Jian-Yun Nie

TL;DR

FinAudio introduces the first AudioLLM benchmark tailored to financial audio, defining three practical tasks (short-clip ASR, long-recording ASR, and long-form summarization) and curating datasets totaling over 400 hours. It evaluates seven AudioLLMs, highlighting that short clips are transcriptionally easier than long-form audio, and that ASR quality strongly influences summarization accuracy. The results show open-source Whisper-v3 often outperforms closed models on ASR, while long-form audio remains a major challenge; the study also analyzes prompt robustness and error categories, emphasizing the need for domain adaptation. The work provides actionable directions for improving numerical recognition, financial terminology handling, and context length in AudioLLMs, and will release all datasets and code to accelerate progress in finance-focused audio intelligence.

Abstract

Audio Large Language Models (AudioLLMs) have received widespread attention and have significantly improved performance on audio tasks such as conversation, audio understanding, and automatic speech recognition (ASR). Despite these advancements, there is an absence of a benchmark for assessing AudioLLMs in financial scenarios, where audio data, such as earnings conference calls and CEO speeches, are crucial resources for financial analysis and investment decisions. In this paper, we introduce \textsc{FinAudio}, the first benchmark designed to evaluate the capacity of AudioLLMs in the financial domain. We first define three tasks based on the unique characteristics of the financial domain: 1) ASR for short financial audio, 2) ASR for long financial audio, and 3) summarization of long financial audio. Then, we curate two short and two long audio datasets, respectively, and develop a novel dataset for financial audio summarization, comprising the \textsc{FinAudio} benchmark. Then, we evaluate seven prevalent AudioLLMs on \textsc{FinAudio}. Our evaluation reveals the limitations of existing AudioLLMs in the financial domain and offers insights for improving AudioLLMs. All datasets and codes will be released.

FinAudio: A Benchmark for Audio Large Language Models in Financial Applications

TL;DR

FinAudio introduces the first AudioLLM benchmark tailored to financial audio, defining three practical tasks (short-clip ASR, long-recording ASR, and long-form summarization) and curating datasets totaling over 400 hours. It evaluates seven AudioLLMs, highlighting that short clips are transcriptionally easier than long-form audio, and that ASR quality strongly influences summarization accuracy. The results show open-source Whisper-v3 often outperforms closed models on ASR, while long-form audio remains a major challenge; the study also analyzes prompt robustness and error categories, emphasizing the need for domain adaptation. The work provides actionable directions for improving numerical recognition, financial terminology handling, and context length in AudioLLMs, and will release all datasets and code to accelerate progress in finance-focused audio intelligence.

Abstract

Audio Large Language Models (AudioLLMs) have received widespread attention and have significantly improved performance on audio tasks such as conversation, audio understanding, and automatic speech recognition (ASR). Despite these advancements, there is an absence of a benchmark for assessing AudioLLMs in financial scenarios, where audio data, such as earnings conference calls and CEO speeches, are crucial resources for financial analysis and investment decisions. In this paper, we introduce \textsc{FinAudio}, the first benchmark designed to evaluate the capacity of AudioLLMs in the financial domain. We first define three tasks based on the unique characteristics of the financial domain: 1) ASR for short financial audio, 2) ASR for long financial audio, and 3) summarization of long financial audio. Then, we curate two short and two long audio datasets, respectively, and develop a novel dataset for financial audio summarization, comprising the \textsc{FinAudio} benchmark. Then, we evaluate seven prevalent AudioLLMs on \textsc{FinAudio}. Our evaluation reveals the limitations of existing AudioLLMs in the financial domain and offers insights for improving AudioLLMs. All datasets and codes will be released.

Paper Structure

This paper contains 24 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Evaluation pipelines for the three FinAudio's tasks.
  • Figure 2: WER on models across different datasets. Whisper-v3 consistently achieved the lowest WER.
  • Figure 3: An overview of critical financial audio data types and applications.
  • Figure 4: Prompt robustness analysis: comparison of WER between fixed-prompt and random-prompt trials.