LaneTCA: Enhancing Video Lane Detection with Temporal Context Aggregation

Keyi Zhou; Li Li; Wengang Zhou; Yonghui Wang; Hao Feng; Houqiang Li

LaneTCA: Enhancing Video Lane Detection with Temporal Context Aggregation

Keyi Zhou, Li Li, Wengang Zhou, Yonghui Wang, Hao Feng, Houqiang Li

TL;DR

An accumulative attention module and an adjacent attention module are developed to abstract the long-term and short-term temporal context, respectively, among successive frames in video lane detection.

Abstract

In video lane detection, there are rich temporal contexts among successive frames, which is under-explored in existing lane detectors. In this work, we propose LaneTCA to bridge the individual video frames and explore how to effectively aggregate the temporal context. Technically, we develop an accumulative attention module and an adjacent attention module to abstract the long-term and short-term temporal context, respectively. The accumulative attention module continuously accumulates visual information during the journey of a vehicle, while the adjacent attention module propagates this lane information from the previous frame to the current frame. The two modules are meticulously designed based on the transformer architecture. Finally, these long-short context features are fused with the current frame features to predict the lane lines in the current frame. Extensive quantitative and qualitative experiments are conducted on two prevalent benchmark datasets. The results demonstrate the effectiveness of our method, achieving several new state-of-the-art records. The codes and models are available at https://github.com/Alex-1337/LaneTCA

LaneTCA: Enhancing Video Lane Detection with Temporal Context Aggregation

TL;DR

Abstract

Paper Structure (16 sections, 6 equations, 8 figures, 5 tables)

This paper contains 16 sections, 6 equations, 8 figures, 5 tables.

Introduction
Related Work
Image Lane Detection
Video Lane Detection
Attention Mechanism
Methodology
Architecture of LaneTCA
Initialization Setting
Training Objective
Experiments
Datasets
Evaluation Metrics
Implementation Details
Comparative Assessment
Ablation Studies
...and 1 more sections

Figures (8)

Figure 1: Comparison of existing video lane detection methods: (a) MMA-Net zhang2021vil, requiring multiple frames of images and detection results, (b) TGC-Net, requiring multiple frames of images, (c) RVLD jin2023recursive, requiring the current frame of image and the previous frame of detection results, and (d) our proposed LaneTCA, requiring the current frame $\bm{I}_t$, adjacent feature $\bm{f}_{AD}$, and accumulative feature $\bm{f}_{AC}$. LaneTCA requires only the visual information from the current frame while also considering information from different time spans. Concretely, $\bm{f}_{AD}$ contains information from $\bm{I}_{t-1}$, while $\bm{f}_{AC}$ contains information from frames $\bm{I}_0$ to $\bm{I}_{t-1}$.
Figure 2: An overview of the proposed LaneTCA. With the given image $\bm{I}_t$, we first apply an encoder to extract features $\bm{F}_t$. The features of the current frame are fed into the current attention module in the temporal context aggregation network. The key $\bm{k}_{t-1}$ and the value $\bm{v}_{t-1}$ from the previous frame with the output of the current attention module are fed into the adjacent attention module. A learnable query $\bm{q}_l$ along with $\bm{k}_{t-1}$ and $\bm{v}_{t-1}$ is input into the accumulative attention module. The outputs from these three channels are combined to obtain the optimized current frame information. The optimized output features are decoded through a series of steps to obtain the probability map $\bm{P}_{t}$ and parameter map $\bm{C}_{t}$, and then processed with NMS to produce the final lane lines $\bm{L}_t$. The optimized features provide $\bm{k}_{t}$ and $\bm{v}_{t}$ for the current frame, while the output of the accumulative channel continues to serve as $\bm{q}_l$.
Figure 3: Illustration of the process of temporal context aggregation and temporal update. The aggregation phase is divided into the following three branches: (a) current attention, (b) adjacent attention, and (c) accumulative attention. By inputting features $\bm{F}_t$ and the optimized $\bm{k}_{t-1}$ and $\bm{v}_{t-1}$ values from the branches produce individual outputs. These three attention results are then summed, obtaining a more accurately optimized feature $\hat{\bm{F}_t}$. Subsequently, new $\bm{k}_t$ and $\bm{v}_t$ values are generated for the optimization of the next frame.
Figure 4: (a) Illustration of the adjacent attention module. The input $\bm{F}_t$ transform to $\bm{q}_t$ after self-attention operation. The attention operation is then conducted to obtain adjacent feature $\bm{f}_{AD}$. (b) Illustration of the accumulative attention module. The $q_l$ is set as the learnable token. $\bm{f}_{AC}$ serves as $\bm{q}_l$ for the accumulative attention block in the subsequent frame.
Figure 5: Visualization comparison with state-of-the-art methods on the VIL-100 dataset. From top to bottom, each row represents: the original image, predictions by MFIALane qiu2022mfialane, predictions by RVLD jin2023recursive, predictions by our method, and the ground truth.
...and 3 more figures

LaneTCA: Enhancing Video Lane Detection with Temporal Context Aggregation

TL;DR

Abstract

LaneTCA: Enhancing Video Lane Detection with Temporal Context Aggregation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)