Certifying Robustness of Learning-Based Keypoint Detection and Pose Estimation Methods

Xusheng Luo; Tianhao Wei; Simin Liu; Ziwei Wang; Luis Mattei-Mendez; Taylor Loper; Joshua Neighbor; Casidhe Hutchison; Changliu Liu

Certifying Robustness of Learning-Based Keypoint Detection and Pose Estimation Methods

Xusheng Luo, Tianhao Wei, Simin Liu, Ziwei Wang, Luis Mattei-Mendez, Taylor Loper, Joshua Neighbor, Casidhe Hutchison, Changliu Liu

TL;DR

This work tackles the problem of certifying the local robustness of a vision-based, two-stage $6D$ pose estimation pipeline that regresses keypoints and solves a PnP problem. The authors convert the robustness certification into a standard neural-network verification task by building a verification-friendly proxy model, modeling the input as a convex hull of semantically perturbed images, and deriving output constraints via a sensitivity analysis of the PnP solver. They prove soundness and completeness under certain conditions and validate the approach on realistic perturbations, including occlusions and global brightness/contrast changes, using ModelVerification.jl on challenging airplane-pose scenarios. The framework enables system-level robustness guarantees for safety-critical applications and provides avenues to extend perturbation sets and reduce conservatism in threshold allocation, with broad potential impact on aviation, autonomous driving, and surgical robotics.

Abstract

This work addresses the certification of the local robustness of vision-based two-stage 6D object pose estimation. The two-stage method for object pose estimation achieves superior accuracy by first employing deep neural network-driven keypoint regression and then applying a Perspective-n-Point (PnP) technique. Despite advancements, the certification of these methods' robustness remains scarce. This research aims to fill this gap with a focus on their local robustness on the system level--the capacity to maintain robust estimations amidst semantic input perturbations. The core idea is to transform the certification of local robustness into neural network verification for classification tasks. The challenge is to develop model, input, and output specifications that align with off-the-shelf verification tools. To facilitate verification, we modify the keypoint detection model by substituting nonlinear operations with those more amenable to the verification processes. Instead of injecting random noise into images, as is common, we employ a convex hull representation of images as input specifications to more accurately depict semantic perturbations. Furthermore, by conducting a sensitivity analysis, we propagate the robustness criteria from pose to keypoint accuracy, and then formulating an optimal error threshold allocation problem that allows for the setting of a maximally permissible keypoint deviation thresholds. Viewing each pixel as an individual class, these thresholds result in linear, classification-akin output specifications. Under certain conditions, we demonstrate that the main components of our certification framework are both sound and complete, and validate its effects through extensive evaluations on realistic perturbations. To our knowledge, this is the first study to certify the robustness of large-scale, keypoint-based pose estimation given images in real-world scenarios.

Certifying Robustness of Learning-Based Keypoint Detection and Pose Estimation Methods

TL;DR

This work tackles the problem of certifying the local robustness of a vision-based, two-stage

pose estimation pipeline that regresses keypoints and solves a PnP problem. The authors convert the robustness certification into a standard neural-network verification task by building a verification-friendly proxy model, modeling the input as a convex hull of semantically perturbed images, and deriving output constraints via a sensitivity analysis of the PnP solver. They prove soundness and completeness under certain conditions and validate the approach on realistic perturbations, including occlusions and global brightness/contrast changes, using ModelVerification.jl on challenging airplane-pose scenarios. The framework enables system-level robustness guarantees for safety-critical applications and provides avenues to extend perturbation sets and reduce conservatism in threshold allocation, with broad potential impact on aviation, autonomous driving, and surgical robotics.

Abstract

Paper Structure (47 sections, 3 theorems, 19 equations, 10 figures, 5 tables)

This paper contains 47 sections, 3 theorems, 19 equations, 10 figures, 5 tables.

Introduction
Overview of the Approach
Model modification for verification.
Input specification through convex hulls.
Output specification via sensitivity analysis.
Contributions
Related Work
Formal Verification of Neural Networks
Certification of Keypoint Detection and Pose Estimation Methods
Background
Keypoint-based Pose Estimation
Verification of Neural Networks
Problem Formulation
Formulation of Verification of the Neural Network
Verification-Friendly Keypoint Detection Model
...and 32 more sections

Key Result

theorem 1

If Assumption asmp:gaussian holds, the proxy model is sound, that is,

Figures (10)

Figure 1: Overview of the PnP-based pose estimation and the proposed verification framework. A thick red dashed line divides the sections for pose estimation (above) and verification (below). For pose estimation, a seed image ${\mathbf X}_0$ is processed by the target model ${\mathbf F}_{\text{target}}$ to identify keypoints, which are then input into the PnP method ${\mathbf G}$ to determine the pose ${\mathbf R}$ and ${\mathbf t}$. The verification framework takes as input the seed image ${\mathbf X}_0$ and a set of perturbed images ${\mathbf X}$ that form the convex hull ${\mathcal{X}}$, along with the pose error bound $\bm{\epsilon}$. Through sensitivity analysis, this pose error bound is transformed into a keypoint error bound, which in turn determines the parameters of the average pooling operation. This substitution replaces the less verification-friendly softmax operation, creating the proxy model ${\mathbf F}_{\text{proxy}}$. By checking the inclusion relation between the reachable set of model ${\mathbf F}_{\text{proxy}}$ and the output specification, the verification tool returns whether the model is robustness.
Figure 2: Pose estimation of an airplane parked at airports is conducted using a PnP-based method. The method uses 23 keypoints, marked in red, which are placed across the airplane's surface to thoroughly cover the aircraft's body, as shown in the 3D model from hikami3150_2024 (right). These keypoints have predefined 3D coordinates within the airplane's coordinate system. An overhead image of the airplane is taken and 2D keypoints, marked in green, are identified through a keypoint detection network. The PnP-based method computes the transformation matrix between the plane and camera coordinate frames.
Figure 3: (a) Graphical depiction of determining the stride parameter. The unnormalized heatmap is overlaid on the airplane. The most saturated area in the heatmap, located near the center, indicates its peak. Consecutive average pooling patches are in different colors, denoted as ${\mathcal{P}}_{-1}, {\mathcal{P}}$, and ${\mathcal{P}}_{+1}$, with dots indicating their centers. The red star denotes the ground-truth keypoint, which aligns with the center of the average pooling patch ${\mathcal{P}}$. The dashed vertical lines, $b_{-1}$ and $b_{+1}$, represent perpendicular bisectors between the centers of adjacent patches, with a distance equal to the stride $s_h$. Consequently, the red pooling patch corresponds to the predicted keypoint. (b) and (c): Determination of the padding parameter.
Figure 4: Illustration of soundness and completeness of the proxy model: The left heatmap, sized $12\times12$ and generated by the target model, is transformed into a $4\times4$ heatmap by the proxy model, utilizing pooling patches with both kernel and stride set to 3. Both the averaged ground-truth keypoint $\bar{{\mathbf v}}_k$ and the ground-truth keypoint ${\mathbf v}_k$ are marked in green. The area highlighted in blue encompasses pixels located within a $\delta {\mathbf v}_k^*$ distance from the ground truth.
Figure 5: Illustration of the probabilistic soundness and completeness. On the left, the green area represents the ground-truth tolerable errors on keypoints $\delta {\mathcal{V}}_{\Xi}$. The red patches (${\mathcal{H}} {\mathcal{R}} \setminus \delta {\mathcal{V}}_{\Xi}$) indicate the keypoint errors that result in pose errors exceeding tolerance. The brown area (${\mathcal{H}} {\mathcal{R}} \cap \delta {\mathcal{V}}_{\Xi}$) shows the keypoint errors that lead to tolerable pose errors. Similarly, on the right, the brown area ($\Xi \cap \Xi_{{\mathcal{H}}{\mathcal{R}}}$) represents the pose errors that cause keypoint errors within ${\mathcal{H}}{\mathcal{R}}$. The dashed black lines indicate the correlations between these two spaces.
...and 5 more figures

Theorems & Definitions (9)

definition 1: Convex hull of images
Remark 1
definition 2: Soundness
definition 3: Completeness
theorem 1
proposition 1
theorem 2
definition 4: Probabilistic Soundness
definition 5: Probabilistic Completeness

Certifying Robustness of Learning-Based Keypoint Detection and Pose Estimation Methods

TL;DR

Abstract

Certifying Robustness of Learning-Based Keypoint Detection and Pose Estimation Methods

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (9)