Table of Contents
Fetching ...

EAGLE: Efficient Adaptive Geometry-based Learning in Cross-view Understanding

Thanh-Dat Truong, Utsav Prabhu, Dongyi Wang, Bhiksha Raj, Susan Gauch, Jeyamkondan Subbiah, Khoa Luu

TL;DR

This work tackles the challenge of semantic segmentation across camera viewpoints by introducing EAGLE, a cross-view adaptation framework that explicitly models geometric changes between views. It combines a cross-view geometric constraint learned on unpaired data, a geodesic-flow-based metric to quantify structure changes along Grassmann manifolds, and a view-conditioned prompting mechanism to inject view-specific context into open-vocab segmentation. The approach yields state-of-the-art results on cross-view benchmarks such as SYNTHIA/GTA/BDD mapped to UAVID, and improves performance in unseen classes of open-vocabulary segmentation. By enabling robust cross-view generalization without requiring paired-view data, EAGLE reduces annotation burdens and enhances deployment reliability in multi-view scenarios.

Abstract

Unsupervised Domain Adaptation has been an efficient approach to transferring the semantic segmentation model across data distributions. Meanwhile, the recent Open-vocabulary Semantic Scene understanding based on large-scale vision language models is effective in open-set settings because it can learn diverse concepts and categories. However, these prior methods fail to generalize across different camera views due to the lack of cross-view geometric modeling. At present, there are limited studies analyzing cross-view learning. To address this problem, we introduce a novel Unsupervised Cross-view Adaptation Learning approach to modeling the geometric structural change across views in Semantic Scene Understanding. First, we introduce a novel Cross-view Geometric Constraint on Unpaired Data to model structural changes in images and segmentation masks across cameras. Second, we present a new Geodesic Flow-based Correlation Metric to efficiently measure the geometric structural changes across camera views. Third, we introduce a novel view-condition prompting mechanism to enhance the view-information modeling of the open-vocabulary segmentation network in cross-view adaptation learning. The experiments on different cross-view adaptation benchmarks have shown the effectiveness of our approach in cross-view modeling, demonstrating that we achieve State-of-the-Art (SOTA) performance compared to prior unsupervised domain adaptation and open-vocabulary semantic segmentation methods.

EAGLE: Efficient Adaptive Geometry-based Learning in Cross-view Understanding

TL;DR

This work tackles the challenge of semantic segmentation across camera viewpoints by introducing EAGLE, a cross-view adaptation framework that explicitly models geometric changes between views. It combines a cross-view geometric constraint learned on unpaired data, a geodesic-flow-based metric to quantify structure changes along Grassmann manifolds, and a view-conditioned prompting mechanism to inject view-specific context into open-vocab segmentation. The approach yields state-of-the-art results on cross-view benchmarks such as SYNTHIA/GTA/BDD mapped to UAVID, and improves performance in unseen classes of open-vocabulary segmentation. By enabling robust cross-view generalization without requiring paired-view data, EAGLE reduces annotation burdens and enhances deployment reliability in multi-view scenarios.

Abstract

Unsupervised Domain Adaptation has been an efficient approach to transferring the semantic segmentation model across data distributions. Meanwhile, the recent Open-vocabulary Semantic Scene understanding based on large-scale vision language models is effective in open-set settings because it can learn diverse concepts and categories. However, these prior methods fail to generalize across different camera views due to the lack of cross-view geometric modeling. At present, there are limited studies analyzing cross-view learning. To address this problem, we introduce a novel Unsupervised Cross-view Adaptation Learning approach to modeling the geometric structural change across views in Semantic Scene Understanding. First, we introduce a novel Cross-view Geometric Constraint on Unpaired Data to model structural changes in images and segmentation masks across cameras. Second, we present a new Geodesic Flow-based Correlation Metric to efficiently measure the geometric structural changes across camera views. Third, we introduce a novel view-condition prompting mechanism to enhance the view-information modeling of the open-vocabulary segmentation network in cross-view adaptation learning. The experiments on different cross-view adaptation benchmarks have shown the effectiveness of our approach in cross-view modeling, demonstrating that we achieve State-of-the-Art (SOTA) performance compared to prior unsupervised domain adaptation and open-vocabulary semantic segmentation methods.
Paper Structure (18 sections, 20 equations, 8 figures, 8 tables)

This paper contains 18 sections, 20 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Our Proposed Cross-view Adaptation Learning Approach. Prior models, e.g., FreeSeg qin2023freeseg, DenseCLIP rao2021denseclip, trained on the car view could not perform well on the drone-view images. Meanwhile, our cross-view adaptation approach is able to generalize well from the car to drone view.
  • Figure 2: Illustration of Cross-View Adaptation.
  • Figure 3: Our Cross-View Learning Framework.
  • Figure 4: The Qualitative Results of Cross-View Adaptation (Without Prompt).
  • Figure 5: The Qualitative Results of Cross-View Adaptation (With Prompt).
  • ...and 3 more figures