Table of Contents
Fetching ...

Making Acoustic Side-Channel Attacks on Noisy Keyboards Viable with LLM-Assisted Spectrograms' "Typo" Correction

Seyyed Ali Ayati, Jin Hyun Park, Yichen Cai, Marcus Botacin

TL;DR

This paper tackles the fragility of acoustic side-channel keystroke inference under realistic noise. It introduces a transformer-based framework that combines Vision Transformers for long-range contextual keystroke classification with Large Language Models for context-aware typo correction, plus a lightweight LoRA-finetuned LLM to enable on-device use. The VT-based models achieve state-of-the-art accuracy in clean conditions, while LLMs dramatically mitigate noise-induced errors, with GPT-4o delivering the best overall performance and LoRA-tuned Llama-3.2-3B providing near-parity at a fraction of the size. The work demonstrates practical viability of ASCA attacks in noisy settings and contributes open-source models and pipelines to foster further research, while highlighting security implications that motivate more robust authentication measures.

Abstract

The large integration of microphones into devices increases the opportunities for Acoustic Side-Channel Attacks (ASCAs), as these can be used to capture keystrokes' audio signals that might reveal sensitive information. However, the current State-Of-The-Art (SOTA) models for ASCAs, including Convolutional Neural Networks (CNNs) and hybrid models, such as CoAtNet, still exhibit limited robustness under realistic noisy conditions. Solving this problem requires either: (i) an increased model's capacity to infer contextual information from longer sequences, allowing the model to learn that an initially noisily typed word is the same as a futurely collected non-noisy word, or (ii) an approach to fix misidentified information from the contexts, as one does not type random words, but the ones that best fit the conversation context. In this paper, we demonstrate that both strategies are viable and complementary solutions for making ASCAs practical. We observed that no existing solution leverages advanced transformer architectures' power for these tasks and propose that: (i) Visual Transformers (VTs) are the candidate solutions for capturing long-term contextual information and (ii) transformer-powered Large Language Models (LLMs) are the candidate solutions to fix the ``typos'' (mispredictions) the model might make. Thus, we here present the first-of-its-kind approach that integrates VTs and LLMs for ASCAs. We first show that VTs achieve SOTA performance in classifying keystrokes when compared to the previous CNN benchmark. Second, we demonstrate that LLMs can mitigate the impact of real-world noise. Evaluations on the natural sentences revealed that: (i) incorporating LLMs (e.g., GPT-4o) in our ASCA pipeline boosts the performance of error-correction tasks; and (ii) the comparable performance can be attained by a lightweight, fine-tuned smaller LLM (67 times smaller than GPT-4o), using...

Making Acoustic Side-Channel Attacks on Noisy Keyboards Viable with LLM-Assisted Spectrograms' "Typo" Correction

TL;DR

This paper tackles the fragility of acoustic side-channel keystroke inference under realistic noise. It introduces a transformer-based framework that combines Vision Transformers for long-range contextual keystroke classification with Large Language Models for context-aware typo correction, plus a lightweight LoRA-finetuned LLM to enable on-device use. The VT-based models achieve state-of-the-art accuracy in clean conditions, while LLMs dramatically mitigate noise-induced errors, with GPT-4o delivering the best overall performance and LoRA-tuned Llama-3.2-3B providing near-parity at a fraction of the size. The work demonstrates practical viability of ASCA attacks in noisy settings and contributes open-source models and pipelines to foster further research, while highlighting security implications that motivate more robust authentication measures.

Abstract

The large integration of microphones into devices increases the opportunities for Acoustic Side-Channel Attacks (ASCAs), as these can be used to capture keystrokes' audio signals that might reveal sensitive information. However, the current State-Of-The-Art (SOTA) models for ASCAs, including Convolutional Neural Networks (CNNs) and hybrid models, such as CoAtNet, still exhibit limited robustness under realistic noisy conditions. Solving this problem requires either: (i) an increased model's capacity to infer contextual information from longer sequences, allowing the model to learn that an initially noisily typed word is the same as a futurely collected non-noisy word, or (ii) an approach to fix misidentified information from the contexts, as one does not type random words, but the ones that best fit the conversation context. In this paper, we demonstrate that both strategies are viable and complementary solutions for making ASCAs practical. We observed that no existing solution leverages advanced transformer architectures' power for these tasks and propose that: (i) Visual Transformers (VTs) are the candidate solutions for capturing long-term contextual information and (ii) transformer-powered Large Language Models (LLMs) are the candidate solutions to fix the ``typos'' (mispredictions) the model might make. Thus, we here present the first-of-its-kind approach that integrates VTs and LLMs for ASCAs. We first show that VTs achieve SOTA performance in classifying keystrokes when compared to the previous CNN benchmark. Second, we demonstrate that LLMs can mitigate the impact of real-world noise. Evaluations on the natural sentences revealed that: (i) incorporating LLMs (e.g., GPT-4o) in our ASCA pipeline boosts the performance of error-correction tasks; and (ii) the comparable performance can be attained by a lightweight, fine-tuned smaller LLM (67 times smaller than GPT-4o), using...

Paper Structure

This paper contains 25 sections, 3 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Example of Error Detection and Correction Using LLMs: The initial text sequence (top) represents the ideal output. The noisy prediction (middle) introduces typographical and semantic errors due to environmental noise or model inaccuracies. The corrected output (bottom) demonstrates how LLMs refine the sequence using contextual understanding, substituting errors (e.g., attwnded to attended).
  • Figure 2: Comparison of Mel spectrograms for the keystroke corresponding to digit "0" from the phone dataset in a clean (left) and noisy (right) scenario. The noisy spectrogram exhibits additional artifacts that distort and attenuate the keystroke's spectral features, reducing the signal energy and making classification significantly more challenging.
  • Figure 3: Audio pre-processing pipeline for classification and LLM-based typo correction.
  • Figure 4: Pipeline for Detecting and Correcting Errors in Keystroke Predictions Using LLMs: From noisy audio waveforms to mel-spectrogram processing, keystroke classification via CoAtNet, and error detection/correction with LLMs, resulting in an accurate and more probable textual output.
  • Figure 5: Performance using metrics -- BLEU, METEOR, ROUGE-1, ROUGE-2, and ROUGE-L -- for different models including the fine-tuned Llama-3.2-3B model at varying noise factors on the Phone and Zoom datasets. For clarity, only the mean is displayed in this graph; the standard deviation is omitted.