Table of Contents
Fetching ...

SilVar: Speech Driven Multimodal Model for Reasoning Visual Question Answering and Object Localization

Tan-Hanh Pham, Hoang-Nam Le, Phu-Vinh Nguyen, Chris Ngo, Truong-Son Hy

TL;DR

SilVar addresses the gap in speech-based reasoning for multimodal visual tasks by integrating open-source components—Whisper for speech, CLIP for vision, and LLaMA 3.1-8B for language—into an end-to-end system capable of reasoning from spoken instructions and localizing objects. A two-stage training pipeline (speech-to-text alignment followed by LLM fine-tuning) and a purpose-built speech-reasoning dataset (SilVar-bench) underpin the approach, with evaluation on MMMU and ScienceQA benchmarks showing competitive performance under speech and strong performance under text. The work also compares speech versus text instructions, analyzes adapter choices for audio-to-LLM transfer, and demonstrates stronger multi-modal explanations and bounding-box grounding relative to purely text-based prompts. By releasing code and data, the paper aims to catalyze open, speech-driven multimodal reasoning research toward more accessible and interactive AI systems.

Abstract

Visual Language Models have demonstrated remarkable capabilities across tasks, including visual question answering and image captioning. However, most models rely on text-based instructions, limiting their effectiveness in human-machine interactions. Moreover, the quality of language models depends on reasoning and prompting techniques, such as COT, which remain underexplored when using speech instructions. To address these challenges, we propose SilVar, a novel end-to-end multimodal model that uses speech instructions for reasoning in visual question answering. In addition, we investigate reasoning techniques with levels including conversational, simple, and complex speech instruction. SilVar is built upon CLIP, Whisper, and LLaMA 3.1-8B, enabling intuitive interactions by allowing users to provide verbal or text instructions. To this end, we introduce a dataset designed to challenge models with speech-based reasoning tasks for object localization. This dataset enhances the model ability to process and explain visual scenes from spoken input, moving beyond object recognition to reasoning-based interactions. The experiments show that SilVar achieves SOTA performance on the MMMU and ScienceQA benchmarks despite the challenge of speech-based instructions. We believe SilVar will inspire next-generation multimodal reasoning models, toward expert artificial general intelligence. Our code and dataset are available here.

SilVar: Speech Driven Multimodal Model for Reasoning Visual Question Answering and Object Localization

TL;DR

SilVar addresses the gap in speech-based reasoning for multimodal visual tasks by integrating open-source components—Whisper for speech, CLIP for vision, and LLaMA 3.1-8B for language—into an end-to-end system capable of reasoning from spoken instructions and localizing objects. A two-stage training pipeline (speech-to-text alignment followed by LLM fine-tuning) and a purpose-built speech-reasoning dataset (SilVar-bench) underpin the approach, with evaluation on MMMU and ScienceQA benchmarks showing competitive performance under speech and strong performance under text. The work also compares speech versus text instructions, analyzes adapter choices for audio-to-LLM transfer, and demonstrates stronger multi-modal explanations and bounding-box grounding relative to purely text-based prompts. By releasing code and data, the paper aims to catalyze open, speech-driven multimodal reasoning research toward more accessible and interactive AI systems.

Abstract

Visual Language Models have demonstrated remarkable capabilities across tasks, including visual question answering and image captioning. However, most models rely on text-based instructions, limiting their effectiveness in human-machine interactions. Moreover, the quality of language models depends on reasoning and prompting techniques, such as COT, which remain underexplored when using speech instructions. To address these challenges, we propose SilVar, a novel end-to-end multimodal model that uses speech instructions for reasoning in visual question answering. In addition, we investigate reasoning techniques with levels including conversational, simple, and complex speech instruction. SilVar is built upon CLIP, Whisper, and LLaMA 3.1-8B, enabling intuitive interactions by allowing users to provide verbal or text instructions. To this end, we introduce a dataset designed to challenge models with speech-based reasoning tasks for object localization. This dataset enhances the model ability to process and explain visual scenes from spoken input, moving beyond object recognition to reasoning-based interactions. The experiments show that SilVar achieves SOTA performance on the MMMU and ScienceQA benchmarks despite the challenge of speech-based instructions. We believe SilVar will inspire next-generation multimodal reasoning models, toward expert artificial general intelligence. Our code and dataset are available here.

Paper Structure

This paper contains 13 sections, 2 figures, 7 tables.

Figures (2)

  • Figure 1: An example from our SilVar-bench dataset, focusing on reasoning speech instructions with different types: conversation, simple reasoning, and complex reasoning. The detected objects are highlighted in yellow bounding boxes. The dataset not only focuses on reasoning instructions but also generates visual explanations, enhancing spatial understanding and interpretability.
  • Figure 2: Illustration of the SilVar model architecture, integrating visual and audio instruction for reasoning text generation and object localization. The model comprises four key components: an audio encoder for extracting features from speech, a visual encoder for processing images, a projector for feature transformation, and an LLM that processes information across modalities to generate coherent responses.