Table of Contents
Fetching ...

EdgePoint2: Compact Descriptors for Superior Efficiency and Accuracy

Haodi Yao, Fenghua He, Ning Hao, Chen Xie

TL;DR

EdgePoint2 tackles the challenge of real-time, accurate keypoint detection and description on resource-constrained edge devices by separating a compact feature encoder from detection and description heads and, crucially, by a descriptor-distillation framework that preserves embedding structure in low dimensions. The core novelty lies in combining Orthogonal Procrustes loss with a similarity loss to distill teacher descriptors into compact student descriptors across dimensions, enabling $32$/$48$/$64$-dimensional representations without sacrificing SOTA performance. The authors offer $14$ sub-models, validate across multiple benchmarks (HPatches, MegaDepth/ScanNet, IMC2022, Aachen/InLoc), and demonstrate strong efficiency and robustness on both GPU-enabled edge accelerators and CPU-only devices, including real-time inference on ARM. This work advances practical deployment of dense, reliable keypoint pipelines in distributed vision systems by reducing bandwidth and computation while maintaining high matching and localization accuracy.

Abstract

The field of keypoint extraction, which is essential for vision applications like Structure from Motion (SfM) and Simultaneous Localization and Mapping (SLAM), has evolved from relying on handcrafted methods to leveraging deep learning techniques. While deep learning approaches have significantly improved performance, they often incur substantial computational costs, limiting their deployment in real-time edge applications. Efforts to create lightweight neural networks have seen some success, yet they often result in trade-offs between efficiency and accuracy. Additionally, the high-dimensional descriptors generated by these networks poses challenges for distributed applications requiring efficient communication and coordination, highlighting the need for compact yet competitively accurate descriptors. In this paper, we present EdgePoint2, a series of lightweight keypoint detection and description neural networks specifically tailored for edge computing applications on embedded system. The network architecture is optimized for efficiency without sacrificing accuracy. To train compact descriptors, we introduce a combination of Orthogonal Procrustes loss and similarity loss, which can serve as a general approach for hypersphere embedding distillation tasks. Additionally, we offer 14 sub-models to satisfy diverse application requirements. Our experiments demonstrate that EdgePoint2 consistently achieves state-of-the-art (SOTA) accuracy and efficiency across various challenging scenarios while employing lower-dimensional descriptors (32/48/64). Beyond its accuracy, EdgePoint2 offers significant advantages in flexibility, robustness, and versatility. Consequently, EdgePoint2 emerges as a highly competitive option for visual tasks, especially in contexts demanding adaptability to diverse computational and communication constraints.

EdgePoint2: Compact Descriptors for Superior Efficiency and Accuracy

TL;DR

EdgePoint2 tackles the challenge of real-time, accurate keypoint detection and description on resource-constrained edge devices by separating a compact feature encoder from detection and description heads and, crucially, by a descriptor-distillation framework that preserves embedding structure in low dimensions. The core novelty lies in combining Orthogonal Procrustes loss with a similarity loss to distill teacher descriptors into compact student descriptors across dimensions, enabling //-dimensional representations without sacrificing SOTA performance. The authors offer sub-models, validate across multiple benchmarks (HPatches, MegaDepth/ScanNet, IMC2022, Aachen/InLoc), and demonstrate strong efficiency and robustness on both GPU-enabled edge accelerators and CPU-only devices, including real-time inference on ARM. This work advances practical deployment of dense, reliable keypoint pipelines in distributed vision systems by reducing bandwidth and computation while maintaining high matching and localization accuracy.

Abstract

The field of keypoint extraction, which is essential for vision applications like Structure from Motion (SfM) and Simultaneous Localization and Mapping (SLAM), has evolved from relying on handcrafted methods to leveraging deep learning techniques. While deep learning approaches have significantly improved performance, they often incur substantial computational costs, limiting their deployment in real-time edge applications. Efforts to create lightweight neural networks have seen some success, yet they often result in trade-offs between efficiency and accuracy. Additionally, the high-dimensional descriptors generated by these networks poses challenges for distributed applications requiring efficient communication and coordination, highlighting the need for compact yet competitively accurate descriptors. In this paper, we present EdgePoint2, a series of lightweight keypoint detection and description neural networks specifically tailored for edge computing applications on embedded system. The network architecture is optimized for efficiency without sacrificing accuracy. To train compact descriptors, we introduce a combination of Orthogonal Procrustes loss and similarity loss, which can serve as a general approach for hypersphere embedding distillation tasks. Additionally, we offer 14 sub-models to satisfy diverse application requirements. Our experiments demonstrate that EdgePoint2 consistently achieves state-of-the-art (SOTA) accuracy and efficiency across various challenging scenarios while employing lower-dimensional descriptors (32/48/64). Beyond its accuracy, EdgePoint2 offers significant advantages in flexibility, robustness, and versatility. Consequently, EdgePoint2 emerges as a highly competitive option for visual tasks, especially in contexts demanding adaptability to diverse computational and communication constraints.

Paper Structure

This paper contains 22 sections, 8 equations, 6 figures, 7 tables, 2 algorithms.

Figures (6)

  • Figure 1: Comparison of local descriptor distillation approaches. The dashed and solid boxes denote descriptors generated by the teacher and student models, respectively. Specifically, (a) illustrates vanilla direct distillation between feature descriptors; (b) optimizes distribution matching via distance matrix loss computation; (c) shows the Orthogonal Procrustes-based alignment with PCA; (d) presents the proposed method that combines Orthogonal Procrustes loss (incorporating LRA) and similarity loss through image augmentation, resulting in SOTA accuracy.
  • Figure 2: EdgePoint2 model architecture. The feature encoder generates multi-scale features, as illustrated in the yellow block. These pyramid features are aggregated using various sizes and operations, as shown in the green block. The detection head and the description head, referred to as the blue block, utilize feature maps of sizes $\frac{H}{2} \times \frac{W}{2}$ and $\frac{H}{4} \times \frac{W}{4}$ to enhance accuracy and efficiency. During post-processing, keypoints are extracted from the detection map using non-maximum suppression (NMS), while the description map is sampled bilinearly. The notation $C_{i}, i\in\{1,2,3,4\}$ denotes the number of output channels for each block. $C_{\mathrm{agg}}$ and $C_{\mathrm{det}}$ denote the number of feature map channels for description and detection respectively and the $C_{\mathrm{desc}}$ is the descriptor dimension. The symbols $(\mathrm{\mathbf{C}})$ and $(\boldsymbol{+})$ represent the concatenation and addition operations, respectively.
  • Figure 3: Illustration of Orthogonal Procrustes loss and similarity loss. The dashed and solid lines represent the descriptors of the teacher and student, respectively. The red and blue arrows indicate descriptors sampled from different locations. For illustration, the teacher descriptors, denoted as $\boldsymbol{D}_\mathrm{t}$, are initially represented within a three-dimensional unit sphere and are subsequently compressed to two dimensions using LRA, referred to as $\boldsymbol{D}_\mathrm{l}$. Given the student descriptors $\boldsymbol{D}_{\mathrm{s},i}$ and $\boldsymbol{D}_{\mathrm{s},j}$, where $i, j \in \{1,2,...,N\}, i \neq j$, the orthogonal matrices $\boldsymbol{\Omega}_i$ and $\boldsymbol{\Omega}_j$ can be determined through Eq. \ref{['opp_solution']}. Subsequently, we can compute the Orthogonal Procrustes loss, which quantifies the cosine distances between the designated descriptors. For the similarity loss, we compute the cosine distance for each descriptor within mini set. The Orthogonal Procrustes loss emphasizes the relative position in embedding space between the descriptors extracted from the same image, while the similarity loss ensures that the corresponding descriptors from different images remain sufficiently close.
  • Figure 4: Runtime efficiency comparison. The x-axis represents FPS on GPU, the y-axis denotes the number of parameters in log scale, and the size of the circles indicates the relative computational cost.
  • Figure 5: Qualitative results on MegaDepth-1500. We choose DISK, XFeat, and the largest models of ALIKE, ALIKED, and AWDesc for comparison. For EdgePoint2, we visualize the results of models of all model sizes (T/S/M/L/E) and dimensions (32/48/64) to demonstrate its consistent performance.
  • ...and 1 more figures