Table of Contents
Fetching ...

Active Human Pose Estimation via an Autonomous UAV Agent

Jingxi Chen, Botao He, Chahat Deep Singh, Cornelia Fermuller, Yiannis Aloimonos

TL;DR

The paper tackles occlusion-driven challenges in 2D human pose estimation from UAV videos by proposing an integrated, autonomous system. It combines NeRF-based drone-view data generation, an on-board PoseErrNet that estimates a 3D perception guidance field from 2D pose observations, and a perception-aware planner that fuses this guidance with UAV dynamics to select feasible camera viewpoints. Key contributions include a drone-view data generation framework, an efficient on-board network for next-view estimation, and a combined planner that ensures perception quality while respecting safety constraints. The approach demonstrates improved pose estimation accuracy and safe navigation in both simulated and real-world scenarios, with potential impact on aerial cinematography and surveillance tasks.

Abstract

One of the core activities of an active observer involves moving to secure a "better" view of the scene, where the definition of "better" is task-dependent. This paper focuses on the task of human pose estimation from videos capturing a person's activity. Self-occlusions within the scene can complicate or even prevent accurate human pose estimation. To address this, relocating the camera to a new vantage point is necessary to clarify the view, thereby improving 2D human pose estimation. This paper formalizes the process of achieving an improved viewpoint. Our proposed solution to this challenge comprises three main components: a NeRF-based Drone-View Data Generation Framework, an On-Drone Network for Camera View Error Estimation, and a Combined Planner for devising a feasible motion plan to reposition the camera based on the predicted errors for camera views. The Data Generation Framework utilizes NeRF-based methods to generate a comprehensive dataset of human poses and activities, enhancing the drone's adaptability in various scenarios. The Camera View Error Estimation Network is designed to evaluate the current human pose and identify the most promising next viewing angles for the drone, ensuring a reliable and precise pose estimation from those angles. Finally, the combined planner incorporates these angles while considering the drone's physical and environmental limitations, employing efficient algorithms to navigate safe and effective flight paths. This system represents a significant advancement in active 2D human pose estimation for an autonomous UAV agent, offering substantial potential for applications in aerial cinematography by improving the performance of autonomous human pose estimation and maintaining the operational safety and efficiency of UAVs.

Active Human Pose Estimation via an Autonomous UAV Agent

TL;DR

The paper tackles occlusion-driven challenges in 2D human pose estimation from UAV videos by proposing an integrated, autonomous system. It combines NeRF-based drone-view data generation, an on-board PoseErrNet that estimates a 3D perception guidance field from 2D pose observations, and a perception-aware planner that fuses this guidance with UAV dynamics to select feasible camera viewpoints. Key contributions include a drone-view data generation framework, an efficient on-board network for next-view estimation, and a combined planner that ensures perception quality while respecting safety constraints. The approach demonstrates improved pose estimation accuracy and safe navigation in both simulated and real-world scenarios, with potential impact on aerial cinematography and surveillance tasks.

Abstract

One of the core activities of an active observer involves moving to secure a "better" view of the scene, where the definition of "better" is task-dependent. This paper focuses on the task of human pose estimation from videos capturing a person's activity. Self-occlusions within the scene can complicate or even prevent accurate human pose estimation. To address this, relocating the camera to a new vantage point is necessary to clarify the view, thereby improving 2D human pose estimation. This paper formalizes the process of achieving an improved viewpoint. Our proposed solution to this challenge comprises three main components: a NeRF-based Drone-View Data Generation Framework, an On-Drone Network for Camera View Error Estimation, and a Combined Planner for devising a feasible motion plan to reposition the camera based on the predicted errors for camera views. The Data Generation Framework utilizes NeRF-based methods to generate a comprehensive dataset of human poses and activities, enhancing the drone's adaptability in various scenarios. The Camera View Error Estimation Network is designed to evaluate the current human pose and identify the most promising next viewing angles for the drone, ensuring a reliable and precise pose estimation from those angles. Finally, the combined planner incorporates these angles while considering the drone's physical and environmental limitations, employing efficient algorithms to navigate safe and effective flight paths. This system represents a significant advancement in active 2D human pose estimation for an autonomous UAV agent, offering substantial potential for applications in aerial cinematography by improving the performance of autonomous human pose estimation and maintaining the operational safety and efficiency of UAVs.
Paper Structure (14 sections, 3 equations, 8 figures, 3 tables)

This paper contains 14 sections, 3 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 0: Our proposed approach features an integrated system with three key components: 1) Drone-View Data Synthesis, which generates realistic drone perspectives of human subjects from various camera angles and human poses, alongside calculating the associated human pose estimation error for these views to serve as training data pairs. 2) PoseErrNet, a network trained on the generated drone-view data pairs, is capable of predicting a 3D perception guidance field for the selection of candidate viewing angles. 3) A comprehensive planner that integrates traditional navigation cost maps with the 3D perception guidance field derived from PoseErrNet. This integration enables effective motion planning, collision avoidance, and the execution of the next-best viewing angle selection for accurate human pose estimation.
  • Figure 1: The process for generating NeRF-based drone-view images of human subjects and 3D perception guidance field data involves using 2D annotations to conduct batch triangulation, resulting in a 3D skeleton for a given human pose. We then render the synthesized image for "drone views", reproject the ground truth 3D skeleton onto NeRF poses to obtain ground truth 2D keypoints, and employ an arbitrary HPE network to predict these keypoints for computing the per camera view HPE error. Through this method, we successfully acquire paired data comprising 2D observations and the corresponding 3D perception guidance field.
  • Figure 2: PoseErrNet: It consists of two major parts: 1) Input abstraction and normalization to deal with the sim-to-real gap and with scale, translation, and rotation invariance for the drone applications, and 2) The auto-encoder network to map from normalized 2D observations to 3D perception guidance fields.
  • Figure 3: Visualization for the calculation of P-ESDF.
  • Figure 4: The testing environment for the proposed motion planning framework.
  • ...and 3 more figures