Discrete Latent Perspective Learning for Segmentation and Detection
Deyi Ji, Feng Zhao, Lanyun Zhu, Wenwei Jin, Hongtao Lu, Jieping Ye
TL;DR
DLPL addresses perspective variability by learning a discrete latent representation of view to enable latent multi-perspective fusion from single-view images. It combines Perspective Discrete Decomposition (PDD), Perspective Homography Transformation (PHT), and Perspective Invariant Attention (PIA) to reconstruct and align multi-perspective features within a Vision Transformer backbone, guided by a reconstruction loss and EMA-based embedding space. The approach yields consistent improvements in segmentation and detection across UAV, auto-driving, and daily photo datasets, with modest parameter overhead and strong cross-task generalization. This work offers a universal framework for perspective-invariant learning that can be integrated with various backbones and datasets, broadening robustness to viewpoint changes in real-world imagery.
Abstract
In this paper, we address the challenge of Perspective-Invariant Learning in machine learning and computer vision, which involves enabling a network to understand images from varying perspectives to achieve consistent semantic interpretation. While standard approaches rely on the labor-intensive collection of multi-view images or limited data augmentation techniques, we propose a novel framework, Discrete Latent Perspective Learning (DLPL), for latent multi-perspective fusion learning using conventional single-view images. DLPL comprises three main modules: Perspective Discrete Decomposition (PDD), Perspective Homography Transformation (PHT), and Perspective Invariant Attention (PIA), which work together to discretize visual features, transform perspectives, and fuse multi-perspective semantic information, respectively. DLPL is a universal perspective learning framework applicable to a variety of scenarios and vision tasks. Extensive experiments demonstrate that DLPL significantly enhances the network's capacity to depict images across diverse scenarios (daily photos, UAV, auto-driving) and tasks (detection, segmentation).
