Table of Contents
Fetching ...

Terrain-Enhanced Resolution-aware Refinement Attention for Off-Road Segmentation

Seongkyu Choi, Jhonghyun An

TL;DR

The paper addresses off-road semantic segmentation under imperfect supervision by introducing a resolution-aware token decoder that concentrates computation at a low-resolution bottleneck while injecting a single high-resolution cue. It combines Global--Local Token Refinement (GLTR), gated high-resolution cross-attention, and Class-Aware Point Refinement (CAPR) to balance global context, local detail, and boundary fidelity, complemented by a training-time Boundary-Band Consistency Loss (BBL). Key contributions include the three-component resolution-aware decoder (TERRA), CAPR for uncertainty-driven sparse refinement, and BBL to regularize boundary interactions without affecting inference cost. Empirical results on RUGD and RELLIS-3D show competitive or superior performance, with notably cleaner boundaries and improved stability under label noise, suggesting practical benefits for autonomous off-road perception and other domains with noisy annotations.

Abstract

Off-road semantic segmentation suffers from thick, inconsistent boundaries, sparse supervision for rare classes, and pervasive label noise. Designs that fuse only at low resolution blur edges and propagate local errors, whereas maintaining high-resolution pathways or repeating high-resolution fusions is costly and fragile to noise. We introduce a resolutionaware token decoder that balances global semantics, local consistency, and boundary fidelity under imperfect supervision. Most computation occurs at a low-resolution bottleneck; a gated cross-attention injects fine-scale detail, and only a sparse, uncertainty-selected set of pixels is refined. The components are co-designed and tightly integrated: global self-attention with lightweight dilated depthwise refinement restores local coherence; a gated cross-attention integrates fine-scale features from a standard high-resolution encoder stream without amplifying noise; and a class-aware point refinement corrects residual ambiguities with negligible overhead. During training, we add a boundary-band consistency regularizer that encourages coherent predictions in a thin neighborhood around annotated edges, with no inference-time cost. Overall, the results indicate competitive performance and improved stability across transitions.

Terrain-Enhanced Resolution-aware Refinement Attention for Off-Road Segmentation

TL;DR

The paper addresses off-road semantic segmentation under imperfect supervision by introducing a resolution-aware token decoder that concentrates computation at a low-resolution bottleneck while injecting a single high-resolution cue. It combines Global--Local Token Refinement (GLTR), gated high-resolution cross-attention, and Class-Aware Point Refinement (CAPR) to balance global context, local detail, and boundary fidelity, complemented by a training-time Boundary-Band Consistency Loss (BBL). Key contributions include the three-component resolution-aware decoder (TERRA), CAPR for uncertainty-driven sparse refinement, and BBL to regularize boundary interactions without affecting inference cost. Empirical results on RUGD and RELLIS-3D show competitive or superior performance, with notably cleaner boundaries and improved stability under label noise, suggesting practical benefits for autonomous off-road perception and other domains with noisy annotations.

Abstract

Off-road semantic segmentation suffers from thick, inconsistent boundaries, sparse supervision for rare classes, and pervasive label noise. Designs that fuse only at low resolution blur edges and propagate local errors, whereas maintaining high-resolution pathways or repeating high-resolution fusions is costly and fragile to noise. We introduce a resolutionaware token decoder that balances global semantics, local consistency, and boundary fidelity under imperfect supervision. Most computation occurs at a low-resolution bottleneck; a gated cross-attention injects fine-scale detail, and only a sparse, uncertainty-selected set of pixels is refined. The components are co-designed and tightly integrated: global self-attention with lightweight dilated depthwise refinement restores local coherence; a gated cross-attention integrates fine-scale features from a standard high-resolution encoder stream without amplifying noise; and a class-aware point refinement corrects residual ambiguities with negligible overhead. During training, we add a boundary-band consistency regularizer that encourages coherent predictions in a thin neighborhood around annotated edges, with no inference-time cost. Overall, the results indicate competitive performance and improved stability across transitions.

Paper Structure

This paper contains 16 sections, 8 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Off-road scenes pose four recurring challenges: (i) diffuse, platform-dependent transitions; (ii) frequent thin structures; (iii) low contrast under shadows, glare, and dust; (iv) vegetation self-occlusion and fine-scale variation.
  • Figure 2: Overall framework. Multi-scale features are fused at a bottleneck and refined by GLTR. A single high-resolution (HR) cross-attention injects sharp cues, and a three-way gate blends $\{T_0, C, B\}$. CAPR revisits only uncertainty-selected pixels. During training, diagonal supervision and a thin boundary-band loss (BBL) regularize attention near edges.
  • Figure 3: Qualitative comparison on RUGD with the baseline (left) and TERRA (right). Red boxes highlight regions where the baseline either misses classes or mixes them. In contrast, TERRA captures thin structures and fine details more precisely, reduces interior holes and clutter, and traces boundaries more sharply and continuously—resulting in segmentations that better reflect the actual scene layout.
  • Figure 4: Qualitative comparison on RUGD with noisy ground-truth (GT). GT labels misalign with actual scene structure, causing GA-Nav to inherit labeling errors and produce fragmented regions. In contrast, TERRA learns more robust representations, yielding cleaner borders and predictions that better align with the true layout despite annotation noise.
  • Figure 5: Qualitative comparison on RELLIS-3D. Columns: RGB, GT, GA-Nav, and our method. Compared with GA-Nav, our method suppresses small holes in wide traversable areas, reduces vegetation clutter, and yields sharper, more continuous boundaries despite annotation noise.