TRAMBA: A Hybrid Transformer and Mamba Architecture for Practical Audio and Bone Conduction Speech Super Resolution and Enhancement on Mobile and Wearable Platforms

Yueyuan Sui; Minghui Zhao; Junxi Xia; Xiaofan Jiang; Stephen Xia

TRAMBA: A Hybrid Transformer and Mamba Architecture for Practical Audio and Bone Conduction Speech Super Resolution and Enhancement on Mobile and Wearable Platforms

Yueyuan Sui, Minghui Zhao, Junxi Xia, Xiaofan Jiang, Stephen Xia

TL;DR

TRAMBA presents a novel hybrid transformer and Mamba architecture for practical speech super-resolution and enhancement on mobile and wearable platforms, focusing on vibration-based sensing (BCM/ACCEL). By pretraining on abundant OTA speech data and fine-tuning with a small amount of user data, TRAMBA achieves state-of-the-art perceptual metrics with a memory footprint under 20 MB and inference speeds up to several hundred times faster than GAN-based rivals. The approach demonstrates robust end-to-end performance across various sensor placements, environments, and sampling rates, while enabling substantial power savings and real-time operation on smartphones and head-worn devices. These results illuminate a viable path for deploying vibration-based speech enhancement in consumer wearables, with practical benefits for battery life and speech quality in noisy conditions.

Abstract

We propose TRAMBA, a hybrid transformer and Mamba architecture for acoustic and bone conduction speech enhancement, suitable for mobile and wearable platforms. Bone conduction speech enhancement has been impractical to adopt in mobile and wearable platforms for several reasons: (i) data collection is labor-intensive, resulting in scarcity; (ii) there exists a performance gap between state of-art models with memory footprints of hundreds of MBs and methods better suited for resource-constrained systems. To adapt TRAMBA to vibration-based sensing modalities, we pre-train TRAMBA with audio speech datasets that are widely available. Then, users fine-tune with a small amount of bone conduction data. TRAMBA outperforms state-of-art GANs by up to 7.3% in PESQ and 1.8% in STOI, with an order of magnitude smaller memory footprint and an inference speed up of up to 465 times. We integrate TRAMBA into real systems and show that TRAMBA (i) improves battery life of wearables by up to 160% by requiring less data sampling and transmission; (ii) generates higher quality voice in noisy environments than over-the-air speech; (iii) requires a memory footprint of less than 20.0 MB.

TRAMBA: A Hybrid Transformer and Mamba Architecture for Practical Audio and Bone Conduction Speech Super Resolution and Enhancement on Mobile and Wearable Platforms

TL;DR

Abstract

Paper Structure (41 sections, 4 equations, 18 figures, 9 tables)

This paper contains 41 sections, 4 equations, 18 figures, 9 tables.

Introduction
Related Works
Audio Super Resolution
Multi-Modal and Vibration-based Speech Enhancement
Audio Super Resolution Architecture Design
Opportunities and Challenges
Heavy Performance and Performance Gap
Data Scarcity
System Design Opportunities
Deep Neural Network Architecture
Preprocessing
Down-Sampling Block
Scale-only Attention-based Feature-wise Linear Modulation (SAFiLM)
Bottleneck
Up-Sampling Block
...and 26 more sections

Figures (18)

Figure 1: TRAMBA enhances vibration-based speech, which is naturally insensitive to ambient noises.
Figure 2: Comparison of performance (PESQ and STOI) vs efficiency (memory footprint, fine-tuning time, inference time) of TRAMBA and state-of-art audio super resolution methods.
Figure 3: Comparison of OTA microphone, BCM, and accelerometer recorded audio under different sampling and filtering schemes. In the low pass filter + decimate scenario, a $100$ order Butterworth filter with a $2kHz$ cut-off frequency was applied.
Figure 4: Super resolution and enhancement architecture.
Figure 5: SAFiLM architecture.
...and 13 more figures

TRAMBA: A Hybrid Transformer and Mamba Architecture for Practical Audio and Bone Conduction Speech Super Resolution and Enhancement on Mobile and Wearable Platforms

TL;DR

Abstract

TRAMBA: A Hybrid Transformer and Mamba Architecture for Practical Audio and Bone Conduction Speech Super Resolution and Enhancement on Mobile and Wearable Platforms

Authors

TL;DR

Abstract

Table of Contents

Figures (18)