Table of Contents
Fetching ...

A Cross-Scale Decoder with Token Refinement for Off-Road Semantic Segmentation

Seongkyu Choi Jhonghyun An

Abstract

Off-road semantic segmentation is fundamentally challenged by irregular terrain, vegetation clutter, and inherent annotation ambiguity. Unlike urban scenes with crisp object boundaries, off-road environments exhibit strong class-level similarity among terrain categories, resulting in thick and uncertain transition regions that degrade boundary coherence and destabilize training. Rare or thin structures, such as narrow traversable gaps or isolated obstacles, further receive sparse and unreliable supervision and are easily overwhelmed by dominant background textures. Existing decoder designs either rely on low-scale bottlenecks that oversmooth fine structural details, or repeatedly fuse high-detail features, which tends to amplify annotation noise and incur substantial computational cost. We present a cross-scale decoder that explicitly addresses these challenges through three complementary mechanisms. First, a global--local token refinement module consolidates semantic context on a compact bottleneck lattice, guided by boundary-aware regularization to remain robust under ambiguous supervision. Second, a gated detail bridge selectively injects fine-scale structural cues only once through cross-scale attention, preserving boundary and texture information while avoiding noise accumulation. Third, an uncertainty-guided class-aware point refinement selectively updates the least reliable pixels, improving rare and ambiguous structures with minimal computational overhead. The resulting framework achieves noise-robust and boundary-preserving segmentation tailored to off-road environments, recovering fine structural details while maintaining deployment-friendly efficiency. Experimental results on standard off-road benchmarks demonstrate consistent improvements over prior approaches without resorting to heavy dense feature fusion.

A Cross-Scale Decoder with Token Refinement for Off-Road Semantic Segmentation

Abstract

Off-road semantic segmentation is fundamentally challenged by irregular terrain, vegetation clutter, and inherent annotation ambiguity. Unlike urban scenes with crisp object boundaries, off-road environments exhibit strong class-level similarity among terrain categories, resulting in thick and uncertain transition regions that degrade boundary coherence and destabilize training. Rare or thin structures, such as narrow traversable gaps or isolated obstacles, further receive sparse and unreliable supervision and are easily overwhelmed by dominant background textures. Existing decoder designs either rely on low-scale bottlenecks that oversmooth fine structural details, or repeatedly fuse high-detail features, which tends to amplify annotation noise and incur substantial computational cost. We present a cross-scale decoder that explicitly addresses these challenges through three complementary mechanisms. First, a global--local token refinement module consolidates semantic context on a compact bottleneck lattice, guided by boundary-aware regularization to remain robust under ambiguous supervision. Second, a gated detail bridge selectively injects fine-scale structural cues only once through cross-scale attention, preserving boundary and texture information while avoiding noise accumulation. Third, an uncertainty-guided class-aware point refinement selectively updates the least reliable pixels, improving rare and ambiguous structures with minimal computational overhead. The resulting framework achieves noise-robust and boundary-preserving segmentation tailored to off-road environments, recovering fine structural details while maintaining deployment-friendly efficiency. Experimental results on standard off-road benchmarks demonstrate consistent improvements over prior approaches without resorting to heavy dense feature fusion.

Paper Structure

This paper contains 17 sections, 9 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Off-road scenes exhibit diverse challenges, including irregular and uneven terrain, ambiguous boundaries between similar surfaces, strong light reflections, and severe occlusion by vegetation.
  • Figure 2: Overall framework of our CSTR. Multi-scale backbone features are compressed into a compact bottleneck token $T_0$ and refined by Global--Local Token Refinement (GLTR). Boundary-Guided Correction extracts fine-scale structural cues, which are selectively consulted via a Gated Cross-Scale Interaction (GCS) to recover thin or ambiguous structures efficiently.
  • Figure 3: Structure of the Boundary-Guided Correction module. Boundary-sensitive cues are extracted through an edge path, while smoothed contextual cues are obtained via a grid path. These fine-scale structural cues are later consulted through gated cross-scale interaction without dense fusion.
  • Figure 4: Qualitative comparison on RUGD highlighting rare and thin structures. Compared to GA-Nav, CSTR better preserves small objects and narrow regions while maintaining coherent terrain boundaries.
  • Figure 5: Qualitative comparison on RELLIS-3D under visually ambiguous terrain transitions. CSTR produces more coherent region interiors and smoother class boundaries around obstacles and vegetation compared to GA-Nav.
  • ...and 1 more figures