Lightweight Operations for Visual Speech Recognition
Iason Ioannis Panagos, Giorgos Sfikas, Christophoros Nikou
TL;DR
This work tackles the high computational burden of visual speech recognition by introducing lightweight architectures that use Ghost modules for both visual feature extraction and sequence modeling, together with a novel Partial Temporal Block to further reduce temporal processing costs. Evaluated on the LRW dataset, the approach achieves robust word recognition while dramatically cutting parameters and FLOPs, making on-device VSR more feasible. Through extensive ablations, the authors map out how channel-split ratios, kernel sizes, and block designs trade off accuracy and efficiency, providing practical guidance for resource-constrained deployments. The publicly released code and models enhance reproducibility and facilitate real-world VSR applications in noisy or audio-less environments.
Abstract
Visual speech recognition (VSR), which decodes spoken words from video data, offers significant benefits, particularly when audio is unavailable. However, the high dimensionality of video data leads to prohibitive computational costs that demand powerful hardware, limiting VSR deployment on resource-constrained devices. This work addresses this limitation by developing lightweight VSR architectures. Leveraging efficient operation design paradigms, we create compact yet powerful models with reduced resource requirements and minimal accuracy loss. We train and evaluate our models on a large-scale public dataset for recognition of words from video sequences, demonstrating their effectiveness for practical applications. We also conduct an extensive array of ablative experiments to thoroughly analyze the size and complexity of each model. Code and trained models will be made publicly available.
