Table of Contents
Fetching ...

Discrete Latent Perspective Learning for Segmentation and Detection

Deyi Ji, Feng Zhao, Lanyun Zhu, Wenwei Jin, Hongtao Lu, Jieping Ye

TL;DR

DLPL addresses perspective variability by learning a discrete latent representation of view to enable latent multi-perspective fusion from single-view images. It combines Perspective Discrete Decomposition (PDD), Perspective Homography Transformation (PHT), and Perspective Invariant Attention (PIA) to reconstruct and align multi-perspective features within a Vision Transformer backbone, guided by a reconstruction loss and EMA-based embedding space. The approach yields consistent improvements in segmentation and detection across UAV, auto-driving, and daily photo datasets, with modest parameter overhead and strong cross-task generalization. This work offers a universal framework for perspective-invariant learning that can be integrated with various backbones and datasets, broadening robustness to viewpoint changes in real-world imagery.

Abstract

In this paper, we address the challenge of Perspective-Invariant Learning in machine learning and computer vision, which involves enabling a network to understand images from varying perspectives to achieve consistent semantic interpretation. While standard approaches rely on the labor-intensive collection of multi-view images or limited data augmentation techniques, we propose a novel framework, Discrete Latent Perspective Learning (DLPL), for latent multi-perspective fusion learning using conventional single-view images. DLPL comprises three main modules: Perspective Discrete Decomposition (PDD), Perspective Homography Transformation (PHT), and Perspective Invariant Attention (PIA), which work together to discretize visual features, transform perspectives, and fuse multi-perspective semantic information, respectively. DLPL is a universal perspective learning framework applicable to a variety of scenarios and vision tasks. Extensive experiments demonstrate that DLPL significantly enhances the network's capacity to depict images across diverse scenarios (daily photos, UAV, auto-driving) and tasks (detection, segmentation).

Discrete Latent Perspective Learning for Segmentation and Detection

TL;DR

DLPL addresses perspective variability by learning a discrete latent representation of view to enable latent multi-perspective fusion from single-view images. It combines Perspective Discrete Decomposition (PDD), Perspective Homography Transformation (PHT), and Perspective Invariant Attention (PIA) to reconstruct and align multi-perspective features within a Vision Transformer backbone, guided by a reconstruction loss and EMA-based embedding space. The approach yields consistent improvements in segmentation and detection across UAV, auto-driving, and daily photo datasets, with modest parameter overhead and strong cross-task generalization. This work offers a universal framework for perspective-invariant learning that can be integrated with various backbones and datasets, broadening robustness to viewpoint changes in real-world imagery.

Abstract

In this paper, we address the challenge of Perspective-Invariant Learning in machine learning and computer vision, which involves enabling a network to understand images from varying perspectives to achieve consistent semantic interpretation. While standard approaches rely on the labor-intensive collection of multi-view images or limited data augmentation techniques, we propose a novel framework, Discrete Latent Perspective Learning (DLPL), for latent multi-perspective fusion learning using conventional single-view images. DLPL comprises three main modules: Perspective Discrete Decomposition (PDD), Perspective Homography Transformation (PHT), and Perspective Invariant Attention (PIA), which work together to discretize visual features, transform perspectives, and fuse multi-perspective semantic information, respectively. DLPL is a universal perspective learning framework applicable to a variety of scenarios and vision tasks. Extensive experiments demonstrate that DLPL significantly enhances the network's capacity to depict images across diverse scenarios (daily photos, UAV, auto-driving) and tasks (detection, segmentation).
Paper Structure (30 sections, 12 equations, 8 figures, 7 tables)

This paper contains 30 sections, 12 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: The comparison of perspective-oriented learning methods: Perspective Augmentation, Multi-View Data Training, and the proposed Discrete Latent Perspective Learning.
  • Figure 2: The variation in image perspectives in three typical scenarios: auto-driving, UAV flight, and daily photos. UAV images exhibit more diverse and dynamic viewpoints due to variations in flight heights, angles, and other factors. However, these datasets lack multi-view images for a single scene, preventing existing networks from effective perspective-invariant learning of the images.
  • Figure 3: The architecture of DLPL framework within a Vision Transformer (ViT) setup, which is divided into an encoder with a Visual Feature Extraction module and four DLPL blocks, and a task-specific decoder. The DLPL block include: (1) Perspective Discrete Decomposition (PDD) module decomposes input visual features $\mathbf{I}$ into a latent representation $\mathbf{P}$; (2) Visual Reconstruction (VR) module reconstructs visual features from $\mathbf{P}$ with a loss function $\mathcal{L}_{rec}$; (3) Perspective Embedding Space $\mathcal{P}$ constructed through EMA, based on $\mathbf{P}$; (4) Perspective Transformation leverages Homography for perspective variation, followed by VR to obtain reconstructed visual features $\mathbf{I}'_{rec}$ of the transformed perspective $\mathbf{P}'$. Finally, Perspective-Invariant Learning fuses the features of the original and transformed perspectives using Perspective-Invariant Attention (PIA) to learn semantic information invariant to perspective changes.
  • Figure 4: The illustration of PDD. It takes visual features $\mathbf{I}$ as input and processes through three key steps: (1) Extraction of perspective prototypes from a point-ness map $\mathbf{S}$ using spatially structured supporting points; (2) Discretization of the continuous perspective distribution into $M$ levels within the latent feature space; (3) Construction of a statistical perspective graph $\mathbf{G}$ by encoding the intensity and positional relationships of discretized levels. This process yields the latent perspective feature map $\mathbf{P}$.
  • Figure 5: The illustration of Visual Reconstruction (VR). It is essentially the inverse of the PDD process, which reconstructs the visual feature from perspective representation.
  • ...and 3 more figures