Vision-Language Modeling with Regularized Spatial Transformer Networks for All Weather Crosswind Landing of Aircraft

Debabrata Pal; Anvita Singh; Saumya Saumya; Shouvik Das

Vision-Language Modeling with Regularized Spatial Transformer Networks for All Weather Crosswind Landing of Aircraft

Debabrata Pal, Anvita Singh, Saumya Saumya, Shouvik Das

TL;DR

This work proposes to synthesize harsh weather landing images by training a prompt-based climatic diffusion network, and optimize a weather distillation model using a novel diffusion-distillation loss to learn to clear weather-induced visual degradations.

Abstract

The intrinsic capability of the Human Vision System (HVS) to perceive depth of field and failure of Instrument Landing Systems (ILS) stimulates a pilot to perform a vision-based manual landing over an autoland approach. However, harsh weather creates challenges, and a pilot must have a clear view of runway elements before the minimum decision altitude. To aid in manual landing, a vision-based system trained to clear weather-induced visual degradations requires a robust landing dataset under various climatic conditions. Nevertheless, to acquire a dataset, flying an aircraft in dangerous weather impacts safety. Also, this system fails to generate reliable warnings, as localization of runway elements suffers from projective distortion while landing at crosswind. To combat, we propose to synthesize harsh weather landing images by training a prompt-based climatic diffusion network. Also, we optimize a weather distillation model using a novel diffusion-distillation loss to learn to clear these visual degradations. Precisely, the distillation model learns an inverse relationship with the diffusion network. Inference time, pre-trained distillation network directly clears weather-impacted onboard camera images, which can be further projected to display devices for improved visibility.Then, to tackle crosswind landing, a novel Regularized Spatial Transformer Networks (RuSTaN) module accurately warps landing images. It minimizes the localization error of runway object detector and helps generate reliable internal software warnings. Finally, we curated an aircraft landing dataset (AIRLAD) by simulating a landing scenario under various weather degradations and experimentally validated our contributions.

Vision-Language Modeling with Regularized Spatial Transformer Networks for All Weather Crosswind Landing of Aircraft

TL;DR

Abstract

Paper Structure (13 sections, 5 equations, 7 figures, 3 tables)

This paper contains 13 sections, 5 equations, 7 figures, 3 tables.

Introduction
Related Work
Proposed Method
Problem Statement
Review of the Spatial Transformer Networks:
Novel Regularized Spatial Transformer Networks:
Proposed framework
Experiments and Analysis
Dataset overview
Evaluation metrics
Experimental results
Ablation analysis
Conclusion

Figures (7)

Figure 1: Row-1) During training, we generate harsh weather landing images using a climatic diffusion network, and a distillation module learns to remove those visual degradation. During real-time cockpit onboard inference, the distillation module directly generates clear weather landing images to display to a pilot. Row-2) In case of aircraft rolling, crosswind landing, the RuSTaN module predicts accurate affine parameters to warp an image with a vertical axis parallel to the runway. It helps predict bounding boxes accurately and avoid any missed or false alert in warning generation.
Figure 2: a) In STN, a Localization Net predicts the transformation parameter $\theta$, and an input image or a feature map is warped based on $\mathcal{T}_{\theta}(G)$. Due to the vanishing or exploding gradient in deep networks and optimizing for overall model objective, $\theta$ can be inaccurate. Hence, in b), we regularize Localization Net to predict transformation parameter $\theta_p$ as $\theta_{GT}^{-1}$ during training, where $\theta_{GT}$ is obtained by a novel affine sampler. During inference, RuSTaN acts as STN where Localization Net directly processes real-time projective distorted image $U$ to predict an accurate $\theta$ based on self-supervised pre-training with inversion constraint.
Figure 3: Illustration of the proposed framework. During training, diverse harsh climatic conditions are synthesized using a VLM or climatic diffusion model, and a generative weather distillation module learns to remove the synthesized weather artifacts. Besides, to learn to recognize runway elements during landing at crosswind, the affine sampler in the RuSTaN module geometrically distorts an input image and, followed by a Localization Net, learns to restore from the projective transformation. Finally, an object detector localizes essential runway elements accurately to generate reliable alerts and aid the pilot in a vision-based landing. Inference time data flow is highlighted in 'Red' color, which relaxes VLM and affine sampler usage as weather degradation and projective distortion at crosswind can inherently impact the real-time camera data.
Figure 4: Sample images of the AIRLAD dataset using Lockheed Martin Prepar3D simulator are shown. (Row-1). Landing in a clear day with 3° attitude, rain, fog, and fog with rain, respectively. (Row-2) shows different instances of landing from decision height to touchdown, featuring the realistic generation of weather artifacts with tire skid marks and crossing taxiway.
Figure 6: Comparative analysis to restore from projective distortion after rotating test image by 10$^\circ$.
...and 2 more figures

Vision-Language Modeling with Regularized Spatial Transformer Networks for All Weather Crosswind Landing of Aircraft

TL;DR

Abstract

Vision-Language Modeling with Regularized Spatial Transformer Networks for All Weather Crosswind Landing of Aircraft

Authors

TL;DR

Abstract

Table of Contents

Figures (7)