Table of Contents
Fetching ...

TD3Net: A temporal densely connected multi-dilated convolutional network for lipreading

Byung Hoon Lee, Wooseok Shin, Sung Won Han

TL;DR

TD3Net addresses the challenge of modeling continuous lip movements by fusing dense skip connections with multi-dilated temporal convolutions, extending the receptive field without blind spots. It adapts the D3Net concept to the time domain through TD2 blocks and densely connects TD2 blocks into TD3 blocks, balancing expressiveness with training stability. Empirical results on LRW and LRW-1000 show competitive word-level lipreading performance with fewer parameters and lower FLOPs, and visualization indicates richer multi-scale temporal features while preserving continuity. The approach also emphasizes real-time applicability, suggesting TD3Net as a lightweight backend for AVSR and streaming lipreading systems.

Abstract

The word-level lipreading approach typically employs a two-stage framework with separate frontend and backend architectures to model dynamic lip movements. Each component has been extensively studied, and in the backend architecture, temporal convolutional networks (TCNs) have been widely adopted in state-of-the-art methods. Recently, dense skip connections have been introduced in TCNs to mitigate the limited density of the receptive field, thereby improving the modeling of complex temporal representations. However, their performance remains constrained owing to potential information loss regarding the continuous nature of lip movements, caused by blind spots in the receptive field. To address this limitation, we propose TD3Net, a temporal densely connected multi-dilated convolutional network that combines dense skip connections and multi-dilated temporal convolutions as the backend architecture. TD3Net covers a wide and dense receptive field without blind spots by applying different dilation factors to skip-connected features. Experimental results on a word-level lipreading task using two large publicly available datasets, Lip Reading in the Wild (LRW) and LRW-1000, indicate that the proposed method achieves performance comparable to state-of-the-art methods. It achieved higher accuracy with fewer parameters and lower floating-point operations compared to existing TCN-based backend architectures. Moreover, visualization results suggest that our approach effectively utilizes diverse temporal features while preserving temporal continuity, presenting notable advantages in lipreading systems. The code is available at our GitHub repository (https://github.com/Leebh-kor/TD3Net).

TD3Net: A temporal densely connected multi-dilated convolutional network for lipreading

TL;DR

TD3Net addresses the challenge of modeling continuous lip movements by fusing dense skip connections with multi-dilated temporal convolutions, extending the receptive field without blind spots. It adapts the D3Net concept to the time domain through TD2 blocks and densely connects TD2 blocks into TD3 blocks, balancing expressiveness with training stability. Empirical results on LRW and LRW-1000 show competitive word-level lipreading performance with fewer parameters and lower FLOPs, and visualization indicates richer multi-scale temporal features while preserving continuity. The approach also emphasizes real-time applicability, suggesting TD3Net as a lightweight backend for AVSR and streaming lipreading systems.

Abstract

The word-level lipreading approach typically employs a two-stage framework with separate frontend and backend architectures to model dynamic lip movements. Each component has been extensively studied, and in the backend architecture, temporal convolutional networks (TCNs) have been widely adopted in state-of-the-art methods. Recently, dense skip connections have been introduced in TCNs to mitigate the limited density of the receptive field, thereby improving the modeling of complex temporal representations. However, their performance remains constrained owing to potential information loss regarding the continuous nature of lip movements, caused by blind spots in the receptive field. To address this limitation, we propose TD3Net, a temporal densely connected multi-dilated convolutional network that combines dense skip connections and multi-dilated temporal convolutions as the backend architecture. TD3Net covers a wide and dense receptive field without blind spots by applying different dilation factors to skip-connected features. Experimental results on a word-level lipreading task using two large publicly available datasets, Lip Reading in the Wild (LRW) and LRW-1000, indicate that the proposed method achieves performance comparable to state-of-the-art methods. It achieved higher accuracy with fewer parameters and lower floating-point operations compared to existing TCN-based backend architectures. Moreover, visualization results suggest that our approach effectively utilizes diverse temporal features while preserving temporal continuity, presenting notable advantages in lipreading systems. The code is available at our GitHub repository (https://github.com/Leebh-kor/TD3Net).

Paper Structure

This paper contains 30 sections, 2 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Visualization of the receptive fields for (a) standard dense block and (b) dilated dense block. Each layer is depicted as a square box, with the dilation factor marked as d. The corresponding receptive fields are presented inside the box. For simplicity, the dense blocks comprise a 1D convolutional layer with a filter size of 3 and a stride of 1.
  • Figure 2: Overall framework of word-level lipreading, including the proposed method.
  • Figure 3: Comparison of blind spots in the receptive field of the green-highlighted output activation in the third layer of the dense block across two TC layer configurations. In (a), using a fixed dilation factor leads to blind spots when the sampling interval exceeds the receptive fields of activations in the red and blue skip-connected feature maps. By contrast, (b) illustrates that the multi-dilated TC layer adaptively sets the dilation factor for each skip-connected path based on its receptive field, thereby preventing blind spots and ensuring temporal continuity.
  • Figure 4: Visualization of the nested structure of the TD3 block that densely connects the TD2 blocks. Each 1D convolution with a kernel size of 1 reduces the number of channels generated by the nested dense skip connections.
  • Figure 5: Visualization of the preprocessing steps performed on the LRW dataset, including the image shapes (channels, height, width) at each stage.
  • ...and 1 more figures