Table of Contents
Fetching ...

TopoMaskV3: 3D Mask Head with Dense Offset and Height Predictions for Road Topology Understanding

Muhammet Esat Kalfaoglu, Halil Ibrahim Ozturk, Ozsel Kilinc, Alptekin Temizel

TL;DR

TopoMaskV3 is introduced, which advances this pipeline into a robust, standalone 3D predictor via two novel dense prediction heads: a dense offset field for sub-grid discretization correction within the existing BEV resolution, and a dense height map for direct 3D estimation.

Abstract

Mask-based paradigms for road topology understanding, such as TopoMaskV2, offer a complementary alternative to query-based methods by generating centerlines via a dense rasterized intermediate representation. However, prior work was limited to 2D predictions and suffered from severe discretization artifacts, necessitating fusion with parametric heads. We introduce TopoMaskV3, which advances this pipeline into a robust, standalone 3D predictor via two novel dense prediction heads: a dense offset field for sub-grid discretization correction within the existing BEV resolution, and a dense height map for direct 3D estimation. Beyond the architecture, we are the first to address geographic data leakage in road topology evaluation by introducing (1) geographically distinct splits to prevent memorization and ensure fair generalization, and (2) a long-range (+/-100 m) benchmark. TopoMaskV3 achieves state-of-the-art 28.5 OLS on this geographically disjoint benchmark, surpassing all prior methods. Our analysis shows that the mask representation is more robust to geographic overfitting than Bezier, while LiDAR fusion is most beneficial at long range and exhibits larger relative gains on the overlapping original split, suggesting overlap-induced memorization effects.

TopoMaskV3: 3D Mask Head with Dense Offset and Height Predictions for Road Topology Understanding

TL;DR

TopoMaskV3 is introduced, which advances this pipeline into a robust, standalone 3D predictor via two novel dense prediction heads: a dense offset field for sub-grid discretization correction within the existing BEV resolution, and a dense height map for direct 3D estimation.

Abstract

Mask-based paradigms for road topology understanding, such as TopoMaskV2, offer a complementary alternative to query-based methods by generating centerlines via a dense rasterized intermediate representation. However, prior work was limited to 2D predictions and suffered from severe discretization artifacts, necessitating fusion with parametric heads. We introduce TopoMaskV3, which advances this pipeline into a robust, standalone 3D predictor via two novel dense prediction heads: a dense offset field for sub-grid discretization correction within the existing BEV resolution, and a dense height map for direct 3D estimation. Beyond the architecture, we are the first to address geographic data leakage in road topology evaluation by introducing (1) geographically distinct splits to prevent memorization and ensure fair generalization, and (2) a long-range (+/-100 m) benchmark. TopoMaskV3 achieves state-of-the-art 28.5 OLS on this geographically disjoint benchmark, surpassing all prior methods. Our analysis shows that the mask representation is more robust to geographic overfitting than Bezier, while LiDAR fusion is most beneficial at long range and exhibits larger relative gains on the overlapping original split, suggesting overlap-induced memorization effects.
Paper Structure (17 sections, 8 equations, 4 figures, 5 tables)

This paper contains 17 sections, 8 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Quad-Direction Labels Encoding. Each centerline is assigned one of four directional labels: up, down, left, or right, based on majority voting between consecutive points. Ties are resolved using the angle between the start and end points.
  • Figure 2: TopoMaskV3 Architecture Overview. The method adopts an instance-query-based design. Bird's Eye View (BEV) features extracted from multi-camera images are processed by a transformer decoder that predicts: binary masks, quad-direction labels, 2D offsets, and height maps. A quad-direction-aware post-processing step then converts these dense outputs into 3D centerline instances.
  • Figure 3: TopoMaskV3 Decoder Architecture. Each sparse query is decoded by five parallel heads, each predicting a different centerline attribute.
  • Figure 4: Offset Refinement Scheme. (a) A continuous straight centerline (blue) and its rasterized representation. (b) Centerpoints obtained using conventional row/column-wise extraction (c) Multi-point proposal predicts an offset vector for each raster pixel to its closest point on the continuous centerline, enabling one-to-many matching (d) Single-point proposal refines the centerpoints by predicting offsets toward their nearest centerline point, enforcing one-to-one matches, and refining centerline localization.