Lane Change Classification and Prediction with Action Recognition Networks

Kai Liang; Jun Wang; Abhir Bhalerao

Lane Change Classification and Prediction with Action Recognition Networks

Kai Liang, Jun Wang, Abhir Bhalerao

TL;DR

This work targets the problem of predicting and classifying lane change maneuvers of surrounding vehicles using semantic visual information rather than relying solely on physical variables. It introduces two end-to-end action-recognition–based frameworks operating on RGB video (RGB+3DN) and RGB video with bounding-box augmentation (RGB+BB+3DN), evaluated across seven 3D CNN architectures, including I3D, SlowFast, and X3D, on the PREVENTION dataset. The study demonstrates state-of-the-art performance for RGB-only lane change classification (up to 84.79% top-1 with X3D-S) and near-perfect results with bounding-box augmentation (≈99% top-1), and provides CAM-based insights into the spatio-temporal regions driving predictions, along with a finding that smaller temporal kernels can better capture motion cues. The results highlight the practicality of action-recognition models for autonomous driving perception, showing significant gains in both classification and early prediction, and suggesting avenues for reducing annotation dependencies by integrating detection pipelines in future work.

Abstract

Anticipating lane change intentions of surrounding vehicles is crucial for efficient and safe driving decision making in an autonomous driving system. Previous works often adopt physical variables such as driving speed, acceleration and so forth for lane change classification. However, physical variables do not contain semantic information. Although 3D CNNs have been developing rapidly, the number of methods utilising action recognition models and appearance feature for lane change recognition is low, and they all require additional information to pre-process data. In this work, we propose an end-to-end framework including two action recognition methods for lane change recognition, using video data collected by cameras. Our method achieves the best lane change classification results using only the RGB video data of the PREVENTION dataset. Class activation maps demonstrate that action recognition models can efficiently extract lane change motions. A method to better extract motion clues is also proposed in this paper.

Lane Change Classification and Prediction with Action Recognition Networks

TL;DR

Abstract

Paper Structure (19 sections, 2 equations, 6 figures, 1 table)

This paper contains 19 sections, 2 equations, 6 figures, 1 table.

Introduction
Related Work
Lane Change Recognition with Physical Variables
Action Recognition for Lane Change Classification
Methods
Problem Formulation
RGB+3DN: 3D Networks and RGB Video Data
RGB+BB+3DN: 3D Networks and Video Combined with Bounding Box Data
Experiments
Dataset
Evaluation Metrics
Lane Change Classification and Prediction with RGB+3DN
Lane Change Classification
Lane Change Prediction
Lane Change Classification and Prediction with RGB+BB+3DN
...and 4 more sections

Figures (6)

Figure 1: A lane change event where a vehicle performs a right lane change. $f_0$ denotes the frame at which lane change starts and $f_1$ is the frame at which the rear middle part of the target vehicle is just between the lanes. The Observation Horizon is defined as 40 frames (4 seconds at 10 FPS before $f_0$). The Prediction horizon or Time To Event (TTE) (on average of length 20 frames at 10 FPS) is defined as the time from $f_0$ to $f_1$.
Figure 2: Architecture of models employed. This figure takes SlowFast$-$R50, X3D$-$S and I3D$-$R50 as example. I3D and X3D take all the frames of a video clip as input. The number of input frames of the fast pathway is α (α = 8) times higher than the slow path way. The fast pathway has a ratio of β (β = $1/8$) channels (underlined) of the slow pathway. The red rectangle shows the temporal information extraction experiments conducted on the global average pooling layer of X3D$-$S.
Figure 3: Input data visualisation (a) RGB video frame data of event; (b) video combined with bounding box data. Only the 1th, 7th, 13th, 19th, 25th and 31st frames are shown. A vehicle in the right lane in (a) and the left lane in (b) perform left and right lane change manoeuvres respectively. The frame data is resized to have aspect ratio of 1 ready for input to a classification CNN.
Figure 4: TP and FP examples: (a) TP example, the vehicle in the middle performing right lane change is correctly classified. (b) FP example, the vehicle in the right lane performing left lane change is classified as lane keeping incorrectly due to small target.
Figure 5: Class activation maps. Only the 25th to 32nd frames of the input video are shown. The model mainly focuses on the frames where lane change happens, as well as the edge of the target vehicle and lane marking which it is about to cross.
...and 1 more figures

Lane Change Classification and Prediction with Action Recognition Networks

TL;DR

Abstract

Lane Change Classification and Prediction with Action Recognition Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (6)