Table of Contents
Fetching ...

Lightweight Operations for Visual Speech Recognition

Iason Ioannis Panagos, Giorgos Sfikas, Christophoros Nikou

TL;DR

This work tackles the high computational burden of visual speech recognition by introducing lightweight architectures that use Ghost modules for both visual feature extraction and sequence modeling, together with a novel Partial Temporal Block to further reduce temporal processing costs. Evaluated on the LRW dataset, the approach achieves robust word recognition while dramatically cutting parameters and FLOPs, making on-device VSR more feasible. Through extensive ablations, the authors map out how channel-split ratios, kernel sizes, and block designs trade off accuracy and efficiency, providing practical guidance for resource-constrained deployments. The publicly released code and models enhance reproducibility and facilitate real-world VSR applications in noisy or audio-less environments.

Abstract

Visual speech recognition (VSR), which decodes spoken words from video data, offers significant benefits, particularly when audio is unavailable. However, the high dimensionality of video data leads to prohibitive computational costs that demand powerful hardware, limiting VSR deployment on resource-constrained devices. This work addresses this limitation by developing lightweight VSR architectures. Leveraging efficient operation design paradigms, we create compact yet powerful models with reduced resource requirements and minimal accuracy loss. We train and evaluate our models on a large-scale public dataset for recognition of words from video sequences, demonstrating their effectiveness for practical applications. We also conduct an extensive array of ablative experiments to thoroughly analyze the size and complexity of each model. Code and trained models will be made publicly available.

Lightweight Operations for Visual Speech Recognition

TL;DR

This work tackles the high computational burden of visual speech recognition by introducing lightweight architectures that use Ghost modules for both visual feature extraction and sequence modeling, together with a novel Partial Temporal Block to further reduce temporal processing costs. Evaluated on the LRW dataset, the approach achieves robust word recognition while dramatically cutting parameters and FLOPs, making on-device VSR more feasible. Through extensive ablations, the authors map out how channel-split ratios, kernel sizes, and block designs trade off accuracy and efficiency, providing practical guidance for resource-constrained deployments. The publicly released code and models enhance reproducibility and facilitate real-world VSR applications in noisy or audio-less environments.

Abstract

Visual speech recognition (VSR), which decodes spoken words from video data, offers significant benefits, particularly when audio is unavailable. However, the high dimensionality of video data leads to prohibitive computational costs that demand powerful hardware, limiting VSR deployment on resource-constrained devices. This work addresses this limitation by developing lightweight VSR architectures. Leveraging efficient operation design paradigms, we create compact yet powerful models with reduced resource requirements and minimal accuracy loss. We train and evaluate our models on a large-scale public dataset for recognition of words from video sequences, demonstrating their effectiveness for practical applications. We also conduct an extensive array of ablative experiments to thoroughly analyze the size and complexity of each model. Code and trained models will be made publicly available.

Paper Structure

This paper contains 15 sections, 3 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Overview of the architecture used for visual speech recognition. We experiment with several feature extractors as well as our proposed lightweight sequence models. The Softmax function is used as the classifier. The overall system outputs a spoken word.
  • Figure 2: Ghost modules. BN indicates the Batch Normalization operation, ReLU indicates the Rectified Linear Unit function, DW refers to the depth-wise convolution, while $\sigma$ is the logistic Sigmoid and $\odot$ is the element-multiplication sign. (a) Original Ghost Module han2020ghostnet. (b) DFC attention tang2022ghostnetv2. (c) Ghost Module with DFC attention.
  • Figure 3: Block designs used in the proposed Partial Temporal Block. (a) The block architecture. "C" represents the amount of channels of the input volume to each component. (b) ShuffleNetma2018shufflenet block architecture. (c) FasterNet chen2023run block components. "DW" and "PW" indicate depth-wise, and point-wise convolutions. "BN" is the Batch Normalization layer and "Act" can be any activation function (e.g., ReLU).