NAC-TCN: Temporal Convolutional Networks with Causal Dilated Neighborhood Attention for Emotion Understanding

Alexander Mehta; William Yang

NAC-TCN: Temporal Convolutional Networks with Causal Dilated Neighborhood Attention for Emotion Understanding

Alexander Mehta, William Yang

TL;DR

NAC-TCN addresses the challenge of modeling temporal emotion cues in videos by fusing Dilated Neighborhood Attention with Temporal Convolutional Networks under strict causality. The method introduces a causal Dilated Neighborhood Attention mechanism within Temporal Blocks and uses residual connections and 1x1 projections to keep parameters small while maintaining performance. Across AffWild2, EmoReact, and AFEW-VA, NAC-TCN achieves competitive or state-of-the-art results with fewer parameters and lower MACs compared to baselines such as GRUs, LSTMs, TCAN, and attention-based models. The work demonstrates that combining local convolutions with targeted attention yields efficient temporal representations for emotion understanding, with public code provided to support reproducibility and broader application.

Abstract

In the task of emotion recognition from videos, a key improvement has been to focus on emotions over time rather than a single frame. There are many architectures to address this task such as GRUs, LSTMs, Self-Attention, Transformers, and Temporal Convolutional Networks (TCNs). However, these methods suffer from high memory usage, large amounts of operations, or poor gradients. We propose a method known as Neighborhood Attention with Convolutions TCN (NAC-TCN) which incorporates the benefits of attention and Temporal Convolutional Networks while ensuring that causal relationships are understood which results in a reduction in computation and memory cost. We accomplish this by introducing a causal version of Dilated Neighborhood Attention while incorporating it with convolutions. Our model achieves comparable, better, or state-of-the-art performance over TCNs, TCAN, LSTMs, and GRUs while requiring fewer parameters on standard emotion recognition datasets. We publish our code online for easy reproducibility and use in other projects.

NAC-TCN: Temporal Convolutional Networks with Causal Dilated Neighborhood Attention for Emotion Understanding

TL;DR

Abstract

Paper Structure (34 sections, 10 equations, 3 figures, 5 tables)

This paper contains 34 sections, 10 equations, 3 figures, 5 tables.

Introduction
Related Works
Recurrent Networks
Attention and Transformer
Background
Temporal Convolutional Networks
Dilated Convolution
Dilated Temporal Convolutional Network
Neighborhood Attention
Dilated Neighborhood Attention (DiNA)
TCAN
Methods
NAC-TCN Formulation
Temporal Block
Motivation for Convolution and Neighborhood Attention Stacking
...and 19 more sections

Figures (3)

Figure 1: The NAC-TCN combines Dilated Temporal Convolutions with Dilated Neighborhood Attention to better capture temporal relationships in video inputs through contextual weighting using Dilated Neighborhood Attention. Our proposed architecture achieves better performance with smaller model size.
Figure 2: TCN architecture
Figure :

NAC-TCN: Temporal Convolutional Networks with Causal Dilated Neighborhood Attention for Emotion Understanding

TL;DR

Abstract

NAC-TCN: Temporal Convolutional Networks with Causal Dilated Neighborhood Attention for Emotion Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (3)