An Efficient and Streaming Audio Visual Active Speaker Detection System

Arnav Kundu; Yanzi Jin; Mohammad Sekhavat; Max Horton; Danny Tormoen; Devang Naik

An Efficient and Streaming Audio Visual Active Speaker Detection System

Arnav Kundu, Yanzi Jin, Mohammad Sekhavat, Max Horton, Danny Tormoen, Devang Naik

TL;DR

This paper introduces a method to limit the number of future context frames utilized by the ASD model, and proposes a more stringent constraint that limits the total number of past frames the model can access during inference, to tackle the persistent memory issues associated with running streaming ASD systems.

Abstract

This paper delves into the challenging task of Active Speaker Detection (ASD), where the system needs to determine in real-time whether a person is speaking or not in a series of video frames. While previous works have made significant strides in improving network architectures and learning effective representations for ASD, a critical gap exists in the exploration of real-time system deployment. Existing models often suffer from high latency and memory usage, rendering them impractical for immediate applications. To bridge this gap, we present two scenarios that address the key challenges posed by real-time constraints. First, we introduce a method to limit the number of future context frames utilized by the ASD model. By doing so, we alleviate the need for processing the entire sequence of future frames before a decision is made, significantly reducing latency. Second, we propose a more stringent constraint that limits the total number of past frames the model can access during inference. This tackles the persistent memory issues associated with running streaming ASD systems. Beyond these theoretical frameworks, we conduct extensive experiments to validate our approach. Our results demonstrate that constrained transformer models can achieve performance comparable to or even better than state-of-the-art recurrent models, such as uni-directional GRUs, with a significantly reduced number of context frames. Moreover, we shed light on the temporal memory requirements of ASD systems, revealing that larger past context has a more profound impact on accuracy than future context. When profiling on a CPU we find that our efficient architecture is memory bound by the amount of past context it can use and that the compute cost is negligible as compared to the memory cost.

An Efficient and Streaming Audio Visual Active Speaker Detection System

TL;DR

Abstract

Paper Structure (8 sections, 4 equations, 4 figures, 1 table)

This paper contains 8 sections, 4 equations, 4 figures, 1 table.

Introduction
Related Works
Streaming ASD
Experiments
Datasets
Implementation details
Evaluation criteria and metrics
Conclusion

Figures (4)

Figure 1: a: Visual Encoder Block (note the padding change in blue), b: Visual / Audio Encoder, c: Full model architecture at train time
Figure 2: Constrained mask for transformer encoder to limit the future context used by the model for predicting one label.
Figure 3: Constrained mask for transformer encoder to limit the past and future context used by the model for predicting one label.
Figure 4: Latency (future context frames) vs memory (past context frames) trade-off on accuracy (mAP%). In this contour plot, the color of each data point (X,Y) indicates the mAP value corresponding to (memory=X, latency=Y). The mAP remains constant along each level curve, with the slope of these curves in different regions revealing which variable exerts greater influence. In areas where level curves are horizontal, future context primarily affects accuracy. Conversely, vertical level curves signify the dominant impact of past context.

An Efficient and Streaming Audio Visual Active Speaker Detection System

TL;DR

Abstract

An Efficient and Streaming Audio Visual Active Speaker Detection System

Authors

TL;DR

Abstract

Table of Contents

Figures (4)