UAV traffic scene understanding: A cross-spectral guided approach and a unified benchmark

Yu Zhang; Zhicheng Zhao; Ze Luo; Chenglong Li; Jin Tang

UAV traffic scene understanding: A cross-spectral guided approach and a unified benchmark

Yu Zhang, Zhicheng Zhao, Ze Luo, Chenglong Li, Jin Tang

TL;DR

A novel Cross-spectral Traffic Cognition Network (CTCNet) is proposed, which leverages high-level semantic prototypes from an external Traffic Regulation Memory to anchor domain-specific knowledge into visual representations, enabling the model to comprehend complex behaviors and distinguish fine-grained traffic violations.

Abstract

Traffic scene understanding from unmanned aerial vehicle (UAV) platforms is crucial for intelligent transportation systems due to its flexible deployment and wide-area monitoring capabilities. However, existing methods face significant challenges in real-world surveillance, as their heavy reliance on optical imagery leads to severe performance degradation under adverse illumination conditions like nighttime and fog. Furthermore, current Visual Question Answering (VQA) models are restricted to elementary perception tasks, lacking the domain-specific regulatory knowledge required to assess complex traffic behaviors. To address these limitations, we propose a novel Cross-spectral Traffic Cognition Network (CTCNet) for robust UAV traffic scene understanding. Specifically, we design a Prototype-Guided Knowledge Embedding (PGKE) module that leverages high-level semantic prototypes from an external Traffic Regulation Memory (TRM) to anchor domain-specific knowledge into visual representations, enabling the model to comprehend complex behaviors and distinguish fine-grained traffic violations. Moreover, we develop a Quality-Aware Spectral Compensation (QASC) module that exploits the complementary characteristics of optical and thermal modalities to perform bidirectional context exchange, effectively compensating for degraded features to ensure robust representation in complex environments. In addition, we construct Traffic-VQA, the first large-scale optical-thermal infrared benchmark for cognitive UAV traffic understanding, comprising 8,180 aligned image pairs and 1.3 million question-answer pairs across 31 diverse types. Extensive experiments demonstrate that CTCNet significantly outperforms state-of-the-art methods in both cognition and perception scenarios. The dataset is available at https://github.com/YuZhang-2004/UAV-traffic-scene-understanding.

UAV traffic scene understanding: A cross-spectral guided approach and a unified benchmark

TL;DR

Abstract

Paper Structure (29 sections, 9 equations, 10 figures, 6 tables)

This paper contains 29 sections, 9 equations, 10 figures, 6 tables.

Introduction
Related Work
Visual Question Answering
Multimodal VQA
Multimodal Large Language Models
Methodology
Overall Architecture
Traffic Regulation Memory Construction
Semantic Phrase Generation
Multi-Modal Visual Grounding
Situation Feature Aggregation
Prototype-Guided Knowledge Embedding
Quality-Aware Spectral Compensation
Loss Function
Traffic-VQA Dataset
...and 14 more sections

Figures (10)

Figure 1: Current challenges in UAV-based traffic VQA. (a) Data Gap. Existing datasets (top) rely on single-modal optical imagery for elementary perception, whereas practical surveillance (bottom) demands aligned OPT-TIR data for complex cognitive understanding. (b) Methodological Bottlenecks. General MLLMs struggle with the Domain Knowledge Gap, failing to interpret specific traffic rules (e.g., missing the "illegal" attribute of a U-turn), and the Multi-Modal Fusion Gap, where static fusion allows degraded optical noise to corrupt robust thermal features under adverse conditions. (c) Our Solution. The proposed CTCNet systematically bridges these gaps through the Traffic-VQA dataset, the PGKE module, and the QASC module.
Figure 2: Overall framework of CTCNet for multi-spectral UAV traffic VQA. The architecture adopts a Gated Parallel Residual paradigm in which the frozen, pre-trained MLLM visual features are adaptively augmented by domain-specific residual knowledge generated by the PGKE and QASC modules. The learnable gating parameters $\alpha$ and $\beta$ regulate the intensity of cognitive and multimodal context injection.
Figure 3: Multi-modal visual grounding in the TRM construction pipeline. Grounding prompts generated from semantic phrase distillation are used to localize traffic entities and behaviors (e.g., linear walkways, vehicle turning). Red boxes indicate the extracted regions of interest, demonstrating accurate text-to-region alignment in both optical and thermal imagery.
Figure 4: Internal architecture of the PGKE module. The module performs question-guided similarity retrieval to identify the top-$K$ most relevant prototypes from the TRM. These prototypes serve as keys and values in a Multi-Head Cross-Attention mechanism, injecting situational domain knowledge into the visual feature streams as an optimized residual increment $\Delta \mathbf{F}^{\mathrm{PGKE}}$.
Figure 5: Illustrative examples from the Traffic-VQA dataset. (a) Synchronized and co-registered optical and TIR UAV image pairs across diverse urban traffic settings. (b) Examples of challenging cognitive question-answer pairs that require deep situational understanding, such as identifying traffic violations and inferring latent behavioral risks.
...and 5 more figures

UAV traffic scene understanding: A cross-spectral guided approach and a unified benchmark

TL;DR

Abstract

UAV traffic scene understanding: A cross-spectral guided approach and a unified benchmark

Authors

TL;DR

Abstract

Table of Contents

Figures (10)