Table of Contents
Fetching ...

Text-Driven Traffic Anomaly Detection with Temporal High-Frequency Modeling in Driving Videos

Rongqin Liang, Yuanman Li, Jiantao Zhou, Xia Li

TL;DR

This work tackles traffic anomaly detection in driving videos under dynamic ego-vehicle motion by proposing TTHF, a text-driven, single-stage framework that aligns video clips with language prompts. It introduces Temporal High-Frequency Modeling to capture rapid scene changes and an Attentive Anomaly Focusing Mechanism that jointly leverages visually and linguistically guided attention. Through a CLIP-like cross-modal training objective with multiple contrastive losses, TTHF achieves state-of-the-art AUC on the DoTA dataset (+5.4%) and demonstrates strong generalization on the DADA-2000 dataset without fine-tuning. The approach offers a practical, language-guided paradigm for robust traffic anomaly detection in autonomous driving and ADAS contexts.

Abstract

Traffic anomaly detection (TAD) in driving videos is critical for ensuring the safety of autonomous driving and advanced driver assistance systems. Previous single-stage TAD methods primarily rely on frame prediction, making them vulnerable to interference from dynamic backgrounds induced by the rapid movement of the dashboard camera. While two-stage TAD methods appear to be a natural solution to mitigate such interference by pre-extracting background-independent features (such as bounding boxes and optical flow) using perceptual algorithms, they are susceptible to the performance of first-stage perceptual algorithms and may result in error propagation. In this paper, we introduce TTHF, a novel single-stage method aligning video clips with text prompts, offering a new perspective on traffic anomaly detection. Unlike previous approaches, the supervised signal of our method is derived from languages rather than orthogonal one-hot vectors, providing a more comprehensive representation. Further, concerning visual representation, we propose to model the high frequency of driving videos in the temporal domain. This modeling captures the dynamic changes of driving scenes, enhances the perception of driving behavior, and significantly improves the detection of traffic anomalies. In addition, to better perceive various types of traffic anomalies, we carefully design an attentive anomaly focusing mechanism that visually and linguistically guides the model to adaptively focus on the visual context of interest, thereby facilitating the detection of traffic anomalies. It is shown that our proposed TTHF achieves promising performance, outperforming state-of-the-art competitors by +5.4% AUC on the DoTA dataset and achieving high generalization on the DADA dataset.

Text-Driven Traffic Anomaly Detection with Temporal High-Frequency Modeling in Driving Videos

TL;DR

This work tackles traffic anomaly detection in driving videos under dynamic ego-vehicle motion by proposing TTHF, a text-driven, single-stage framework that aligns video clips with language prompts. It introduces Temporal High-Frequency Modeling to capture rapid scene changes and an Attentive Anomaly Focusing Mechanism that jointly leverages visually and linguistically guided attention. Through a CLIP-like cross-modal training objective with multiple contrastive losses, TTHF achieves state-of-the-art AUC on the DoTA dataset (+5.4%) and demonstrates strong generalization on the DADA-2000 dataset without fine-tuning. The approach offers a practical, language-guided paradigm for robust traffic anomaly detection in autonomous driving and ADAS contexts.

Abstract

Traffic anomaly detection (TAD) in driving videos is critical for ensuring the safety of autonomous driving and advanced driver assistance systems. Previous single-stage TAD methods primarily rely on frame prediction, making them vulnerable to interference from dynamic backgrounds induced by the rapid movement of the dashboard camera. While two-stage TAD methods appear to be a natural solution to mitigate such interference by pre-extracting background-independent features (such as bounding boxes and optical flow) using perceptual algorithms, they are susceptible to the performance of first-stage perceptual algorithms and may result in error propagation. In this paper, we introduce TTHF, a novel single-stage method aligning video clips with text prompts, offering a new perspective on traffic anomaly detection. Unlike previous approaches, the supervised signal of our method is derived from languages rather than orthogonal one-hot vectors, providing a more comprehensive representation. Further, concerning visual representation, we propose to model the high frequency of driving videos in the temporal domain. This modeling captures the dynamic changes of driving scenes, enhances the perception of driving behavior, and significantly improves the detection of traffic anomalies. In addition, to better perceive various types of traffic anomalies, we carefully design an attentive anomaly focusing mechanism that visually and linguistically guides the model to adaptively focus on the visual context of interest, thereby facilitating the detection of traffic anomalies. It is shown that our proposed TTHF achieves promising performance, outperforming state-of-the-art competitors by +5.4% AUC on the DoTA dataset and achieving high generalization on the DADA dataset.
Paper Structure (30 sections, 15 equations, 7 figures, 8 tables)

This paper contains 30 sections, 15 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Existing TAD approaches of single-stage paradigm (a) and two-stage paradigm (b) vs. the proposed TTHF framework (c). Existing single-stage approaches mainly rely on frame prediction, which is difficult to adapt to detecting traffic anomalies with a dynamic background, while the two-stage TAD approaches are vulnerable to the performance of the first-stage perceptual algorithms. The proposed TTHF framework is text-driven and focuses on capturing dynamic changes in driving scenes through modeling temporal high frequency to facilitate traffic anomaly detection.
  • Figure 2: Overview of our proposed TTHF. It is a CLIP-like framework for traffic anomaly detection. In this framework, we first apply a visual encoder to extract visual representations of driving video clips. Then, we propose Temporal High-Frequency Modeling (THFM) to characterize the dynamic changes of driving scenes and thus construct a more comprehensive representation of driving videos. Finally, we introduce an attentive anomaly focusing mechanism (AAFM) to enhance the perception of various types of traffic anomalies. Besides, for brevity, we denote the cross-attention as CA, the visually focused representation as VFR, and the linguistically focused representation as LFR.
  • Figure 3: An illustration of the AAFM. The original video frames are displayed in column (a). In column (b), we visualize the attention of the visual representation to the deep features of a video clip under the visually focused strategy (VFS). In column (c), we visualize the attention of the soft text representation to the deep features of a video clip under the linguistically focused strategy (LFS). We present two types of traffic anomaly scenarios. Specifically, case 1 illustrates an instance where the ego-vehicle experiences loss of control while executing a turn. In case 2, the driving vehicle observes a collision between the car turning ahead and the motorcycle traveling straight on the right.
  • Figure 4: An illustration of the high frequency. We show 3 cases as examples. The first and second columns correspond to the original consecutive video frames, and the last column is the high-frequency component extracted along the temporal dimension.
  • Figure 5: The visualization of anomaly score curves for traffic anomaly detection of different variants on the DoTA dataset. The first row of each case shows the extracted video frames of the driving video, where the red boxes mark the object involved in or causing the anomaly. The second rows show the anomaly score curves of different methods on the corresponding whole videos. For brevity, we label the TTHF-Base variant as Base and TTHF-Base with AAFM as AAFM, while Classifier denotes the classify-based TAD method. Better viewed in color.
  • ...and 2 more figures