Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection

Shruti Palaskar; Oggi Rudovic; Sameer Dharur; Florian Pesce; Gautam Krishna; Aswin Sivaraman; Jack Berkowitz; Ahmed Hussen Abdelaziz; Saurabh Adya; Ahmed Tewfik

Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection

Shruti Palaskar, Oggi Rudovic, Sameer Dharur, Florian Pesce, Gautam Krishna, Aswin Sivaraman, Jack Berkowitz, Ahmed Hussen Abdelaziz, Saurabh Adya, Ahmed Tewfik

TL;DR

The paper tackles DDSD by extending a text-only LLM with multimodal cues using Fusion Low Rank Adaptation (FLoRA), a framework that adds lightweight modality adapters and preserves the backbone. By training small per-modality adapters and employing adapter dropout, FLoRA achieves about $22\%$ relative improvement in $EER$ over text-only baselines and reaches parity with full fine-tuning while updating only $1$–$5\%$ of parameters. The approach is robust to missing modalities and scales from $16\mathrm{M}$ to $3\mathrm{B}$ parameters, offering practical benefits for on-device and large-scale multimodal VA applications. Overall, FLoRA enables efficient, plug-and-play multimodal integration into pretrained LLMs, reducing data and compute requirements for multimodal DDSD tasks while maintaining strong performance.

Abstract

Although Large Language Models (LLMs) have shown promise for human-like conversations, they are primarily pre-trained on text data. Incorporating audio or video improves performance, but collecting large-scale multimodal data and pre-training multimodal LLMs is challenging. To this end, we propose a Fusion Low Rank Adaptation (FLoRA) technique that efficiently adapts a pre-trained unimodal LLM to consume new, previously unseen modalities via low rank adaptation. For device-directed speech detection, using FLoRA, the multimodal LLM achieves 22% relative reduction in equal error rate (EER) over the text-only approach and attains performance parity with its full fine-tuning (FFT) counterpart while needing to tune only a fraction of its parameters. Furthermore, with the newly introduced adapter dropout, FLoRA is robust to missing data, improving over FFT by 20% lower EER and 56% lower false accept rate. The proposed approach scales well for model sizes from 16M to 3B parameters.

Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection

TL;DR

relative improvement in

over text-only baselines and reaches parity with full fine-tuning while updating only

–

of parameters. The approach is robust to missing modalities and scales from

parameters, offering practical benefits for on-device and large-scale multimodal VA applications. Overall, FLoRA enables efficient, plug-and-play multimodal integration into pretrained LLMs, reducing data and compute requirements for multimodal DDSD tasks while maintaining strong performance.

Abstract

Paper Structure (15 sections, 2 equations, 3 figures, 3 tables)

This paper contains 15 sections, 2 equations, 3 figures, 3 tables.

Introduction
Approach
Fusion LoRA
Modality-specific Adapters
Adapter Dropout
Experimental Setup
Classification via Generation
Datasets
Baselines and Toplines
Results and Discussion
Fusion via Low Rank Adaptation
Adapter Dropout
Scaling Model Size
Conclusion
Acknowledgements

Figures (3)

Figure 1: High-level architecture showing proposed FLoRA adapters. A frozen LLM loads modality-specific adapters to take advantage of all available modalities at training or inference time. The architecture is robust to missing modalities by dropping corresponding adapters (video in the diagram above).
Figure 2: Model architecture showing modality-specific FLoRA layers in an encoder-decoder model. Dotted lines across audio and video modalities are optional and the corresponding adapters are trained (during training) or invoked (during inference) only if the modality is available.
Figure 3: Model scalability for model sizes from 16M parameters to 3B parameters for the T5 architecture. The FLoRA technique scales across all model sizes with performance on B2B-3 improving consistently with increase in model size as expected.

Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection

TL;DR

Abstract

Multimodal Large Language Models with Fusion Low Rank Adaptation for Device Directed Speech Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (3)