Table of Contents
Fetching ...

Safeguarding Privacy in Edge Speech Understanding with Tiny Foundation Models

Afsara Benazir, Felix Xiaozhu Lin

TL;DR

SpeechShield tackles privacy risks in cloud-based ASR by moving entity filtering onto a tiny on-device foundation model. It uses a lightweight on-device NER to label token-level entities, timestamps to mask sensitive spans, and two recovery strategies to merge edge and cloud transcripts while preserving transcription quality. The approach achieves ~83% on-device entity masking with memory under 100 MB, remains 3.3× faster and vastly more compute-efficient than prior privacy frameworks, and maintains competitive WER relative to full-cloud transcription. It also offers meaningful downstream capabilities, such as intent classification, while validating practical deployment on low-resource edge devices. The work demonstrates how tiny FMs can extend privacy protections without sacrificing real-world usability in edge–cloud speech pipelines.

Abstract

Robust speech recognition systems rely on cloud service providers for inference. It needs to ensure that an untrustworthy provider cannot deduce the sensitive content in speech. Sanitization can be done on speech content keeping in mind that it has to avoid compromising transcription accuracy. Realizing the under utilized capabilities of tiny speech foundation models (FMs), for the first time, we propose a novel use: enhancing speech privacy on resource-constrained devices. We introduce SpeechShield, an edge/cloud privacy preserving speech inference engine that can filter sensitive entities without compromising transcript accuracy. We utilize a timestamp based on-device masking approach that utilizes a token to entity prediction model to filter sensitive entities. Our choice of mask strategically conceals parts of the input and hides sensitive data. The masked input is sent to a trusted cloud service or to a local hub to generate the masked output. The effectiveness of SpeechShield hinges on how well the entity time segments are masked. Our recovery is a confidence score based approach that chooses the best prediction between cloud and on-device model. We implement SpeechShield on a 64 bit Raspberry Pi 4B. Experiments show that our solution leads to robust speech recognition without forsaking privacy. SpeechShield with < 100 MB memory, achieves state-of-the-art (SOTA) speech transcription performance while filtering about 83% of private entities directly on-device. SpeechShield is 16x smaller in memory, 3.3x faster and 17x more compute efficient than prior privacy preserving speech frameworks and has a relative reduction in word error rate (WER) by 38.8-77.5% when compared to existing offline transcription services.

Safeguarding Privacy in Edge Speech Understanding with Tiny Foundation Models

TL;DR

SpeechShield tackles privacy risks in cloud-based ASR by moving entity filtering onto a tiny on-device foundation model. It uses a lightweight on-device NER to label token-level entities, timestamps to mask sensitive spans, and two recovery strategies to merge edge and cloud transcripts while preserving transcription quality. The approach achieves ~83% on-device entity masking with memory under 100 MB, remains 3.3× faster and vastly more compute-efficient than prior privacy frameworks, and maintains competitive WER relative to full-cloud transcription. It also offers meaningful downstream capabilities, such as intent classification, while validating practical deployment on low-resource edge devices. The work demonstrates how tiny FMs can extend privacy protections without sacrificing real-world usability in edge–cloud speech pipelines.

Abstract

Robust speech recognition systems rely on cloud service providers for inference. It needs to ensure that an untrustworthy provider cannot deduce the sensitive content in speech. Sanitization can be done on speech content keeping in mind that it has to avoid compromising transcription accuracy. Realizing the under utilized capabilities of tiny speech foundation models (FMs), for the first time, we propose a novel use: enhancing speech privacy on resource-constrained devices. We introduce SpeechShield, an edge/cloud privacy preserving speech inference engine that can filter sensitive entities without compromising transcript accuracy. We utilize a timestamp based on-device masking approach that utilizes a token to entity prediction model to filter sensitive entities. Our choice of mask strategically conceals parts of the input and hides sensitive data. The masked input is sent to a trusted cloud service or to a local hub to generate the masked output. The effectiveness of SpeechShield hinges on how well the entity time segments are masked. Our recovery is a confidence score based approach that chooses the best prediction between cloud and on-device model. We implement SpeechShield on a 64 bit Raspberry Pi 4B. Experiments show that our solution leads to robust speech recognition without forsaking privacy. SpeechShield with < 100 MB memory, achieves state-of-the-art (SOTA) speech transcription performance while filtering about 83% of private entities directly on-device. SpeechShield is 16x smaller in memory, 3.3x faster and 17x more compute efficient than prior privacy preserving speech frameworks and has a relative reduction in word error rate (WER) by 38.8-77.5% when compared to existing offline transcription services.

Paper Structure

This paper contains 50 sections, 7 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Untrustworthy cloud and risk of privacy. (a) All on-device execution fails at ASR inference while (b) leaks all sensitive information. (c) shows prior privacy preserving speech transcription framework (d) Ours is a secure timestamp based speech filter on mobile devices.
  • Figure 2: Overview of SpeechShield. Green text represents successfully recovered words. Bold blue text are sensitive, private information. Red text is erroneous prediction.
  • Figure 3: Masking configuration
  • Figure 4: Confidence Score based transcript recovery. Red box are entity segments. Tokens in final transcript are chosen based on their confidence score given certain conditions.
  • Figure 5: Confusion matrix of token level masking errors. For the entity 'Intel' a TP (true positive) is when that token is masked, FP (false positive) is when another non-entity token is masked and FN (false negative) is the failure to mask the entity token.
  • ...and 2 more figures