Table of Contents
Fetching ...

MamKPD: A Simple Mamba Baseline for Real-Time 2D Keypoint Detection

Yonghao Dang, Liyuan Liu, Hui Kang, Ping Ye, Jianqin Yin

TL;DR

This work introduces MamKPD, the first Mamba-based baseline for real-time 2D keypoint detection, addressing the limited inter-patch context of traditional Mamba blocks with a lightweight Contextual Modeling Module (CMM) and a three-stage Mamba encoder. The architecture combines a stem for noise reduction, CMM-enhanced stages, and an SS2D module to aggregate patch information, achieving high speed (e.g., up to 1492 FPS on a single RTX 4090) while maintaining competitive accuracy across COCO, MPII, and AP-10K, and delivering substantial parameter savings compared with ViTPose. Ablation studies show the importance of Stem and CMM for multi-scale context and inter-patch communication, and qualitative results illustrate robust keypoint localization in both humans and animals. The approach offers a practical, efficient solution for real-time pose estimation with strong cross-domain performance and is released for open-source use.

Abstract

Real-time 2D keypoint detection plays an essential role in computer vision. Although CNN-based and Transformer-based methods have achieved breakthrough progress, they often fail to deliver superior performance and real-time speed. This paper introduces MamKPD, the first efficient yet effective mamba-based pose estimation framework for 2D keypoint detection. The conventional Mamba module exhibits limited information interaction between patches. To address this, we propose a lightweight contextual modeling module (CMM) that uses depth-wise convolutions to model inter-patch dependencies and linear layers to distill the pose cues within each patch. Subsequently, by combining Mamba for global modeling across all patches, MamKPD effectively extracts instances' pose information. We conduct extensive experiments on human and animal pose estimation datasets to validate the effectiveness of MamKPD. Our MamKPD-L achieves 77.3% AP on the COCO dataset with 1492 FPS on an NVIDIA GTX 4090 GPU. Moreover, MamKPD achieves state-of-the-art results on the MPII dataset and competitive results on the AP-10K dataset while saving 85% of the parameters compared to ViTPose. Our project page is available at https://mamkpd.github.io/.

MamKPD: A Simple Mamba Baseline for Real-Time 2D Keypoint Detection

TL;DR

This work introduces MamKPD, the first Mamba-based baseline for real-time 2D keypoint detection, addressing the limited inter-patch context of traditional Mamba blocks with a lightweight Contextual Modeling Module (CMM) and a three-stage Mamba encoder. The architecture combines a stem for noise reduction, CMM-enhanced stages, and an SS2D module to aggregate patch information, achieving high speed (e.g., up to 1492 FPS on a single RTX 4090) while maintaining competitive accuracy across COCO, MPII, and AP-10K, and delivering substantial parameter savings compared with ViTPose. Ablation studies show the importance of Stem and CMM for multi-scale context and inter-patch communication, and qualitative results illustrate robust keypoint localization in both humans and animals. The approach offers a practical, efficient solution for real-time pose estimation with strong cross-domain performance and is released for open-source use.

Abstract

Real-time 2D keypoint detection plays an essential role in computer vision. Although CNN-based and Transformer-based methods have achieved breakthrough progress, they often fail to deliver superior performance and real-time speed. This paper introduces MamKPD, the first efficient yet effective mamba-based pose estimation framework for 2D keypoint detection. The conventional Mamba module exhibits limited information interaction between patches. To address this, we propose a lightweight contextual modeling module (CMM) that uses depth-wise convolutions to model inter-patch dependencies and linear layers to distill the pose cues within each patch. Subsequently, by combining Mamba for global modeling across all patches, MamKPD effectively extracts instances' pose information. We conduct extensive experiments on human and animal pose estimation datasets to validate the effectiveness of MamKPD. Our MamKPD-L achieves 77.3% AP on the COCO dataset with 1492 FPS on an NVIDIA GTX 4090 GPU. Moreover, MamKPD achieves state-of-the-art results on the MPII dataset and competitive results on the AP-10K dataset while saving 85% of the parameters compared to ViTPose. Our project page is available at https://mamkpd.github.io/.

Paper Structure

This paper contains 21 sections, 5 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Comparisons of performance and parameters on COCO val2017 set. The circles' size represents the model's scale, i.e., the number of the model's parameters.
  • Figure 2: Overview of the proposed MamKPD is illustrated in Figure (a). Figures (b) and (c) display Stem and MamKPD stage structures.
  • Figure 3: Ablation studies about Stem and CMM. "Base" represents the pure Mamba model. Different colors represent the performance improvements of other models relative to the baseline model.
  • Figure 4: Ablation studies about CMM. The 1st and 3rd rows show the feature visualizations of MamKPD without the CMM. The 2nd and 4th rows display the feature visualizations of the proposed MamKPD.
  • Figure 5: Feature visualization for each MamKPD's stage.
  • ...and 2 more figures