Table of Contents
Fetching ...

MediaPipe Hands: On-device Real-time Hand Tracking

Fan Zhang, Valentin Bazarevsky, Andrey Vakunov, Andrei Tkachenka, George Sung, Chuo-Ling Chang, Matthias Grundmann

TL;DR

This work presents a real-time, on-device hand tracking pipeline that predicts 2.5D hand landmarks from RGB input using a two-model approach: a BlazePalm palm detector and a hand landmark regressor, both implemented in MediaPipe. The system uses frame-to-frame propagation to reduce detector calls and is trained with a combination of real and synthetic datasets to improve accuracy and depth estimation. Key contributions include the mobile-optimized detector, a robust 21-landmark model with depth supervision, and an open-source MediaPipe implementation enabling cross-platform AR/gesture applications. The approach achieves real-time performance on commodity devices and supports multi-hand tracking with practical gating and synchronization mechanisms. Overall, MediaPipe Hands provides a practical, extensible solution for on-device hand tracking and interaction in AR/VR contexts.

Abstract

We present a real-time on-device hand tracking pipeline that predicts hand skeleton from single RGB camera for AR/VR applications. The pipeline consists of two models: 1) a palm detector, 2) a hand landmark model. It's implemented via MediaPipe, a framework for building cross-platform ML solutions. The proposed model and pipeline architecture demonstrates real-time inference speed on mobile GPUs and high prediction quality. MediaPipe Hands is open sourced at https://mediapipe.dev.

MediaPipe Hands: On-device Real-time Hand Tracking

TL;DR

This work presents a real-time, on-device hand tracking pipeline that predicts 2.5D hand landmarks from RGB input using a two-model approach: a BlazePalm palm detector and a hand landmark regressor, both implemented in MediaPipe. The system uses frame-to-frame propagation to reduce detector calls and is trained with a combination of real and synthetic datasets to improve accuracy and depth estimation. Key contributions include the mobile-optimized detector, a robust 21-landmark model with depth supervision, and an open-source MediaPipe implementation enabling cross-platform AR/gesture applications. The approach achieves real-time performance on commodity devices and supports multi-hand tracking with practical gating and synchronization mechanisms. Overall, MediaPipe Hands provides a practical, extensible solution for on-device hand tracking and interaction in AR/VR contexts.

Abstract

We present a real-time on-device hand tracking pipeline that predicts hand skeleton from single RGB camera for AR/VR applications. The pipeline consists of two models: 1) a palm detector, 2) a hand landmark model. It's implemented via MediaPipe, a framework for building cross-platform ML solutions. The proposed model and pipeline architecture demonstrates real-time inference speed on mobile GPUs and high prediction quality. MediaPipe Hands is open sourced at https://mediapipe.dev.

Paper Structure

This paper contains 9 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Rendered hand tracking result. (Left): Hand landmarks with relative depth presented in different shades. The lighter and larger the circle, the closer the landmark is towards the camera. (Right): Real-time multi-hand tracking on Pixel 3.
  • Figure 2: Palm detector model architecture.
  • Figure 3: Architecture of our hand landmark model. The model has three outputs sharing a feature extractor. Each head is trained by correspondent datasets marked in the same color. See Section \ref{['hand_landmark_model']} for more detail.
  • Figure 4: Examples of our datasets. (Top): Annotated real-world images. (Bottom): Rendered synthetic hand images with ground truth annotation. See Section \ref{['dataset']} for details.
  • Figure 5: The hand landmark model’s output controls when the hand detection model is triggered. This behavior is achieved by MediaPipe’s powerful synchronization building blocks, resulting in high performance and optimal throughput of the ML pipeline.
  • ...and 2 more figures