Table of Contents
Fetching ...

MENASpeechBank: A Reference Voice Bank with Persona-Conditioned Multi-Turn Conversations for AudioLLMs

Zien Sheikh Ali, Hunzalah Hassan Bhatti, Rabindra Nath Nandi, Shammur Absar Chowdhury, Firoj Alam

TL;DR

MENASpeechBank tackles the data bottleneck for AudioLLMs by delivering a bilingual MENA reference voice bank and a controllable persona-to-dialogue-to-speech pipeline. The approach builds 469 grounded personas using World Values Survey-inspired attributes, expands domain coverage to 4,521 scenarios, and generates approximately 2.1 million audio–text instruction pairs through persona-driven GPT-4.1 dialogue and speaker-conditioned synthesis, enabling end-to-end AudioLLM adaptation. Evaluation with an LLM-as-a-judge across synthetic and human speech shows strong rubric-level performance, with audio-native models (Gemini-2.5 Pro) delivering the best overall results and fine-tuning providing notable gains. The work enables controlled, dialect- and persona-aware evaluation and data generation for AudioLLMs and will publicly release both MENASpeechBank and the synthetic conversations to accelerate research and practical deployment in multilingual, voice-based assistants.

Abstract

Audio large language models (AudioLLMs) enable instruction-following over speech and general audio, but progress is increasingly limited by the lack of diverse, conversational, instruction-aligned speech-text data. This bottleneck is especially acute for persona-grounded interactions and dialectal coverage, where collecting and releasing real multi-speaker recordings is costly and slow. We introduce MENASpeechBank, a reference speech bank comprising about 18K high-quality utterances from 124 speakers spanning multiple MENA countries, covering English, Modern Standard Arabic (MSA), and regional Arabic varieties. Building on this resource, we develop a controllable synthetic data pipeline that: (i) constructs persona profiles enriched with World Values Survey-inspired attributes, (ii) defines a taxonomy of about 5K conversational scenarios, (iii) matches personas to scenarios via semantic similarity, (iv) generates about 417K role-play conversations with an LLM where the user speaks as the persona and the assistant behaves as a helpful agent, and (v) synthesizes the user turns by conditioning on reference speaker audio to preserve speaker identity and diversity. We evaluate both synthetic and human-recorded conversations and provide detailed analysis. We will release MENASpeechBank and the generated conversations publicly for the community.

MENASpeechBank: A Reference Voice Bank with Persona-Conditioned Multi-Turn Conversations for AudioLLMs

TL;DR

MENASpeechBank tackles the data bottleneck for AudioLLMs by delivering a bilingual MENA reference voice bank and a controllable persona-to-dialogue-to-speech pipeline. The approach builds 469 grounded personas using World Values Survey-inspired attributes, expands domain coverage to 4,521 scenarios, and generates approximately 2.1 million audio–text instruction pairs through persona-driven GPT-4.1 dialogue and speaker-conditioned synthesis, enabling end-to-end AudioLLM adaptation. Evaluation with an LLM-as-a-judge across synthetic and human speech shows strong rubric-level performance, with audio-native models (Gemini-2.5 Pro) delivering the best overall results and fine-tuning providing notable gains. The work enables controlled, dialect- and persona-aware evaluation and data generation for AudioLLMs and will publicly release both MENASpeechBank and the synthetic conversations to accelerate research and practical deployment in multilingual, voice-based assistants.

Abstract

Audio large language models (AudioLLMs) enable instruction-following over speech and general audio, but progress is increasingly limited by the lack of diverse, conversational, instruction-aligned speech-text data. This bottleneck is especially acute for persona-grounded interactions and dialectal coverage, where collecting and releasing real multi-speaker recordings is costly and slow. We introduce MENASpeechBank, a reference speech bank comprising about 18K high-quality utterances from 124 speakers spanning multiple MENA countries, covering English, Modern Standard Arabic (MSA), and regional Arabic varieties. Building on this resource, we develop a controllable synthetic data pipeline that: (i) constructs persona profiles enriched with World Values Survey-inspired attributes, (ii) defines a taxonomy of about 5K conversational scenarios, (iii) matches personas to scenarios via semantic similarity, (iv) generates about 417K role-play conversations with an LLM where the user speaks as the persona and the assistant behaves as a helpful agent, and (v) synthesizes the user turns by conditioning on reference speaker audio to preserve speaker identity and diversity. We evaluate both synthetic and human-recorded conversations and provide detailed analysis. We will release MENASpeechBank and the generated conversations publicly for the community.
Paper Structure (33 sections, 3 equations, 7 figures, 10 tables)

This paper contains 33 sections, 3 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: An example of a persona, a conversational scenario, and conversational turns. Persona attributes are derived from basic demographic information, country-specific WVS values, and heuristics.
  • Figure 2: An overview of MenaSpeechBank development pipeline.
  • Figure 3: Distribution of conversation domains and their respective subcategories.
  • Figure 4: System and user prompts used for persona-to-summary generation.
  • Figure 5: Root-left taxonomy with two highlighted branches: task/service domains (blue) and knowledge/topic domains (green).
  • ...and 2 more figures