Invisible Servoing: a Visual Servoing Approach with Return-Conditioned Latent Diffusion
Bishoy Gerges, Barbara Bazzana, Nicolò Botteghi, Youssef Aboudorra, Antonio Franchi
TL;DR
This work addresses UAV visual servoing under target-invisibility, where conventional VS methods fail when the target is occluded or out of view. It proposes a latent-diffusion framework that operates in a compact latent space learned by a Cross-Modal Variational Autoencoder (CM-VAE) and uses return-conditioned latent DDPMs to generate trajectories toward a target view; a dedicated return Estimation heuristic ties planning to feasible, smooth motion. The approach is validated in Gazebo simulations with a quadrotor and a hexarotor, demonstrating recovery of the target view and successful visuospatial alignment despite initial invisibility, and showing improved tracking in closed-loop receding-horizon experiments. The proposed combination of CM-VAE, latent DDPM planning, and return-conditioned control offers a robust alternative to feature-based VS, with potential for end-to-end real-world deployment and MPC integration.
Abstract
In this paper, we present a novel visual servoing (VS) approach based on latent Denoising Diffusion Probabilistic Models (DDPMs), that explores the application of generative models for vision-based navigation of UAVs (Uncrewed Aerial Vehicles). Opposite to classical VS methods, the proposed approach allows reaching the desired target view, even when the target is initially not visible. This is possible thanks to the learning of a latent representation that the DDPM uses for planning and a dataset of trajectories encompassing target-invisible initial views. A compact representation is learned from raw images using a Cross-Modal Variational Autoencoder. Given the current image, the DDPM generates trajectories in the latent space driving the robotic platform to the desired visual target. The approach has been validated in simulation using two generic multi-rotor UAVs (a quadrotor and a hexarotor). The results show that we can successfully reach the visual target, even if not visible in the initial view.
