MENASpeechBank: A Reference Voice Bank with Persona-Conditioned Multi-Turn Conversations for AudioLLMs

Zien Sheikh Ali; Hunzalah Hassan Bhatti; Rabindra Nath Nandi; Shammur Absar Chowdhury; Firoj Alam

MENASpeechBank: A Reference Voice Bank with Persona-Conditioned Multi-Turn Conversations for AudioLLMs

Zien Sheikh Ali, Hunzalah Hassan Bhatti, Rabindra Nath Nandi, Shammur Absar Chowdhury, Firoj Alam

TL;DR

MENASpeechBank tackles the data bottleneck for AudioLLMs by delivering a bilingual MENA reference voice bank and a controllable persona-to-dialogue-to-speech pipeline. The approach builds 469 grounded personas using World Values Survey-inspired attributes, expands domain coverage to 4,521 scenarios, and generates approximately 2.1 million audio–text instruction pairs through persona-driven GPT-4.1 dialogue and speaker-conditioned synthesis, enabling end-to-end AudioLLM adaptation. Evaluation with an LLM-as-a-judge across synthetic and human speech shows strong rubric-level performance, with audio-native models (Gemini-2.5 Pro) delivering the best overall results and fine-tuning providing notable gains. The work enables controlled, dialect- and persona-aware evaluation and data generation for AudioLLMs and will publicly release both MENASpeechBank and the synthetic conversations to accelerate research and practical deployment in multilingual, voice-based assistants.

Abstract

Audio large language models (AudioLLMs) enable instruction-following over speech and general audio, but progress is increasingly limited by the lack of diverse, conversational, instruction-aligned speech-text data. This bottleneck is especially acute for persona-grounded interactions and dialectal coverage, where collecting and releasing real multi-speaker recordings is costly and slow. We introduce MENASpeechBank, a reference speech bank comprising about 18K high-quality utterances from 124 speakers spanning multiple MENA countries, covering English, Modern Standard Arabic (MSA), and regional Arabic varieties. Building on this resource, we develop a controllable synthetic data pipeline that: (i) constructs persona profiles enriched with World Values Survey-inspired attributes, (ii) defines a taxonomy of about 5K conversational scenarios, (iii) matches personas to scenarios via semantic similarity, (iv) generates about 417K role-play conversations with an LLM where the user speaks as the persona and the assistant behaves as a helpful agent, and (v) synthesizes the user turns by conditioning on reference speaker audio to preserve speaker identity and diversity. We evaluate both synthetic and human-recorded conversations and provide detailed analysis. We will release MENASpeechBank and the generated conversations publicly for the community.

MENASpeechBank: A Reference Voice Bank with Persona-Conditioned Multi-Turn Conversations for AudioLLMs

TL;DR

Abstract

Paper Structure (33 sections, 3 equations, 7 figures, 10 tables)

This paper contains 33 sections, 3 equations, 7 figures, 10 tables.

Introduction
Related Work
AudioLLMs
Synthetic Instruction Data
Synthetic Speech and Training Data
MenaSpeechBank
Reference Audio Collection
Persona Generation
Taxonomy, Scenario and Conversations
Conversational Speech Generation
Dataset Statistics
Reference speakers.
Speakers Statistics
Reference speech samples.
Domain and subcategory wise distribution.
...and 18 more sections

Figures (7)

Figure 1: An example of a persona, a conversational scenario, and conversational turns. Persona attributes are derived from basic demographic information, country-specific WVS values, and heuristics.
Figure 2: An overview of MenaSpeechBank development pipeline.
Figure 3: Distribution of conversation domains and their respective subcategories.
Figure 4: System and user prompts used for persona-to-summary generation.
Figure 5: Root-left taxonomy with two highlighted branches: task/service domains (blue) and knowledge/topic domains (green).
...and 2 more figures

MENASpeechBank: A Reference Voice Bank with Persona-Conditioned Multi-Turn Conversations for AudioLLMs

TL;DR

Abstract

MENASpeechBank: A Reference Voice Bank with Persona-Conditioned Multi-Turn Conversations for AudioLLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (7)