Table of Contents
Fetching ...

Homography Guided Temporal Fusion for Road Line and Marking Segmentation

Shan Wang, Chuong Nguyen, Jiawei Liu, Kaihao Zhang, Wenhan Luo, Yanhao Zhang, Sundaram Muthu, Fahira Afzal Maken, Hongdong Li

TL;DR

This work tackles occlusion and lighting challenges in road line and marking segmentation for autonomous driving by combining geometric and temporal cues. It introduces HomoFusion, a homography-guided cross-frame attention module, and RSNE, a differentiable road surface normal estimator, to fuse adjacent frames and recover partially occluded markings. The approach yields state-of-the-art performance on ApolloScape and ApolloScape Night with far fewer parameters and GFLOPs, and demonstrates applicability to water puddle segmentation, highlighting its efficiency and versatility for real-time driving systems. By exploiting a ground-plane assumption and camera intrinsics, the method achieves robust cross-frame alignment and improved segmentation accuracy in challenging conditions, advancing practical deployment in edge devices.

Abstract

Reliable segmentation of road lines and markings is critical to autonomous driving. Our work is motivated by the observations that road lines and markings are (1) frequently occluded in the presence of moving vehicles, shadow, and glare and (2) highly structured with low intra-class shape variance and overall high appearance consistency. To solve these issues, we propose a Homography Guided Fusion (HomoFusion) module to exploit temporally-adjacent video frames for complementary cues facilitating the correct classification of the partially occluded road lines or markings. To reduce computational complexity, a novel surface normal estimator is proposed to establish spatial correspondences between the sampled frames, allowing the HomoFusion module to perform a pixel-to-pixel attention mechanism in updating the representation of the occluded road lines or markings. Experiments on ApolloScape, a large-scale lane mark segmentation dataset, and ApolloScape Night with artificial simulated night-time road conditions, demonstrate that our method outperforms other existing SOTA lane mark segmentation models with less than 9\% of their parameters and computational complexity. We show that exploiting available camera intrinsic data and ground plane assumption for cross-frame correspondence can lead to a light-weight network with significantly improved performances in speed and accuracy. We also prove the versatility of our HomoFusion approach by applying it to the problem of water puddle segmentation and achieving SOTA performance.

Homography Guided Temporal Fusion for Road Line and Marking Segmentation

TL;DR

This work tackles occlusion and lighting challenges in road line and marking segmentation for autonomous driving by combining geometric and temporal cues. It introduces HomoFusion, a homography-guided cross-frame attention module, and RSNE, a differentiable road surface normal estimator, to fuse adjacent frames and recover partially occluded markings. The approach yields state-of-the-art performance on ApolloScape and ApolloScape Night with far fewer parameters and GFLOPs, and demonstrates applicability to water puddle segmentation, highlighting its efficiency and versatility for real-time driving systems. By exploiting a ground-plane assumption and camera intrinsics, the method achieves robust cross-frame alignment and improved segmentation accuracy in challenging conditions, advancing practical deployment in edge devices.

Abstract

Reliable segmentation of road lines and markings is critical to autonomous driving. Our work is motivated by the observations that road lines and markings are (1) frequently occluded in the presence of moving vehicles, shadow, and glare and (2) highly structured with low intra-class shape variance and overall high appearance consistency. To solve these issues, we propose a Homography Guided Fusion (HomoFusion) module to exploit temporally-adjacent video frames for complementary cues facilitating the correct classification of the partially occluded road lines or markings. To reduce computational complexity, a novel surface normal estimator is proposed to establish spatial correspondences between the sampled frames, allowing the HomoFusion module to perform a pixel-to-pixel attention mechanism in updating the representation of the occluded road lines or markings. Experiments on ApolloScape, a large-scale lane mark segmentation dataset, and ApolloScape Night with artificial simulated night-time road conditions, demonstrate that our method outperforms other existing SOTA lane mark segmentation models with less than 9\% of their parameters and computational complexity. We show that exploiting available camera intrinsic data and ground plane assumption for cross-frame correspondence can lead to a light-weight network with significantly improved performances in speed and accuracy. We also prove the versatility of our HomoFusion approach by applying it to the problem of water puddle segmentation and achieving SOTA performance.
Paper Structure (27 sections, 15 equations, 17 figures, 6 tables, 1 algorithm)

This paper contains 27 sections, 15 equations, 17 figures, 6 tables, 1 algorithm.

Figures (17)

  • Figure 1: Illustration of the effect of the proposed HomoFusion module that explores the adjacent frames for cues, facilitating the correct classification of (1) a "Straight Arrow", which, with its bottom half occluded by a vehicle, is mistakenly classified as a "Right Turn & Straight Arrow" without the HomoFusion module, and (2) a partially occluded "Dotted Line", which is incorrectly classified as a "Solid Line" without the HomoFusion module. The fused frame in the $4^{th}$ row and the $2^{nd}$ column demonstrates the recovered road lines and markings after projecting the previous frames onto the current frame with the estimated homography matrices. The yellow box enlarges the area where mistake classifications are corrected. The red box indicates the spatially corresponding area across the frames. Best viewed in color.
  • Figure 2: Overview of our proposed model consisting of a pair of lightweight encoder and decoder, our proposed HomoFusion module, and our proposed Road Surface Normal Estimator (RSNE). A sequence of frames $\mathbf{I}$, including a target frame $\mathbf{I_{t}}$ and $n - 1$ previous frames, are encoded into the feature representations ($\mathbf{F}^{l}$). RNSE estimates the road surface normal vector, which, combined with the camera intrinsic and extrinsic parameters, yields a homography matrix between each frame pair, establishing cross-frame spatial correspondences. HomoFusion uses pixel-to-pixel attention mechanism to obtain temporally consistent representation for on-road pixels of the current frame with the spatial correspondence across frames as guidance. Finally, the decoder decodes and upsamples the temporally consistent feature representations to produce the lane mark segmentation prediction ($\mathbf{S_{t}^{prd}}$).
  • Figure 3: Illustration of sample points. (Right) Sample points in the current frame. (Left/Middle) Correspondence of a sample point in previous frames. The red point coordinate is calculated by using the correct normal, while the cyan point coordinate is calculated by using the initial (incorrect) normal.
  • Figure 4: Sample images from the ApolloScape Night dataset. Top: original daytime images from the ApolloScape dataset. Bottom: synthesized night-time images.
  • Figure 5: Qualitative comparison with SOTA methods. The top two examples are from the ApolloScape huang2018apolloscape dataset, and the bottom two examples are from the ApolloScape Night dataset. Yellow boxes highlight the area of interest for better visualization. Red boxes indicate false-positive segmentation predictions. Best viewed in color.
  • ...and 12 more figures