Terrain-Enhanced Resolution-aware Refinement Attention for Off-Road Segmentation
Seongkyu Choi, Jhonghyun An
TL;DR
The paper addresses off-road semantic segmentation under imperfect supervision by introducing a resolution-aware token decoder that concentrates computation at a low-resolution bottleneck while injecting a single high-resolution cue. It combines Global--Local Token Refinement (GLTR), gated high-resolution cross-attention, and Class-Aware Point Refinement (CAPR) to balance global context, local detail, and boundary fidelity, complemented by a training-time Boundary-Band Consistency Loss (BBL). Key contributions include the three-component resolution-aware decoder (TERRA), CAPR for uncertainty-driven sparse refinement, and BBL to regularize boundary interactions without affecting inference cost. Empirical results on RUGD and RELLIS-3D show competitive or superior performance, with notably cleaner boundaries and improved stability under label noise, suggesting practical benefits for autonomous off-road perception and other domains with noisy annotations.
Abstract
Off-road semantic segmentation suffers from thick, inconsistent boundaries, sparse supervision for rare classes, and pervasive label noise. Designs that fuse only at low resolution blur edges and propagate local errors, whereas maintaining high-resolution pathways or repeating high-resolution fusions is costly and fragile to noise. We introduce a resolutionaware token decoder that balances global semantics, local consistency, and boundary fidelity under imperfect supervision. Most computation occurs at a low-resolution bottleneck; a gated cross-attention injects fine-scale detail, and only a sparse, uncertainty-selected set of pixels is refined. The components are co-designed and tightly integrated: global self-attention with lightweight dilated depthwise refinement restores local coherence; a gated cross-attention integrates fine-scale features from a standard high-resolution encoder stream without amplifying noise; and a class-aware point refinement corrects residual ambiguities with negligible overhead. During training, we add a boundary-band consistency regularizer that encourages coherent predictions in a thin neighborhood around annotated edges, with no inference-time cost. Overall, the results indicate competitive performance and improved stability across transitions.
