Table of Contents
Fetching ...

Unsupervised Skin Feature Tracking with Deep Neural Networks

Jose Chang, Torbjörn E. M. Nordling

TL;DR

This work tackles skin-feature tracking for remote heart-rate estimation and Parkinson's gait assessment by introducing Deep Feature Encodings (DFE), an unsupervised, convolutional autoencoder-based approach. By encoding $128$-dimensional latent representations of $31\times31$ face/hand crops and applying a Gaussian-weighted loss to suppress edge biases, DFE delivers accurate, robust tracking under significant motion, outperforming SIFT, SURF, LK, PIPs++, and CoTracker with mean errors in the $0.6$–$3.3$ pixel range. The study thoroughly validates DFE against multiple baselines on face mole and nose-tip features in both static and dynamic conditions, and demonstrates generalization to Parkinson's disease hand data, highlighting the method's data-efficiency and applicability to motion-rich skin-tracking tasks. Overall, the unsupervised, skin-specific descriptors enable reliable feature matching and registration in challenging imaging scenarios, with practical implications for non-contact photoplethysmography and motor disorder assessment, particularly when labeled data are scarce.

Abstract

Facial feature tracking is essential in imaging ballistocardiography for accurate heart rate estimation and enables motor degradation quantification in Parkinson's disease through skin feature tracking. While deep convolutional neural networks have shown remarkable accuracy in tracking tasks, they typically require extensive labeled data for supervised training. Our proposed pipeline employs a convolutional stacked autoencoder to match image crops with a reference crop containing the target feature, learning deep feature encodings specific to the object category in an unsupervised manner, thus reducing data requirements. To overcome edge effects making the performance dependent on crop size, we introduced a Gaussian weight on the residual errors of the pixels when calculating the loss function. Training the autoencoder on facial images and validating its performance on manually labeled face and hand videos, our Deep Feature Encodings (DFE) method demonstrated superior tracking accuracy with a mean error ranging from 0.6 to 3.3 pixels, outperforming traditional methods like SIFT, SURF, Lucas Kanade, and the latest transformers like PIPs++ and CoTracker. Overall, our unsupervised learning approach excels in tracking various skin features under significant motion conditions, providing superior feature descriptors for tracking, matching, and image registration compared to both traditional and state-of-the-art supervised learning methods.

Unsupervised Skin Feature Tracking with Deep Neural Networks

TL;DR

This work tackles skin-feature tracking for remote heart-rate estimation and Parkinson's gait assessment by introducing Deep Feature Encodings (DFE), an unsupervised, convolutional autoencoder-based approach. By encoding -dimensional latent representations of face/hand crops and applying a Gaussian-weighted loss to suppress edge biases, DFE delivers accurate, robust tracking under significant motion, outperforming SIFT, SURF, LK, PIPs++, and CoTracker with mean errors in the pixel range. The study thoroughly validates DFE against multiple baselines on face mole and nose-tip features in both static and dynamic conditions, and demonstrates generalization to Parkinson's disease hand data, highlighting the method's data-efficiency and applicability to motion-rich skin-tracking tasks. Overall, the unsupervised, skin-specific descriptors enable reliable feature matching and registration in challenging imaging scenarios, with practical implications for non-contact photoplethysmography and motor disorder assessment, particularly when labeled data are scarce.

Abstract

Facial feature tracking is essential in imaging ballistocardiography for accurate heart rate estimation and enables motor degradation quantification in Parkinson's disease through skin feature tracking. While deep convolutional neural networks have shown remarkable accuracy in tracking tasks, they typically require extensive labeled data for supervised training. Our proposed pipeline employs a convolutional stacked autoencoder to match image crops with a reference crop containing the target feature, learning deep feature encodings specific to the object category in an unsupervised manner, thus reducing data requirements. To overcome edge effects making the performance dependent on crop size, we introduced a Gaussian weight on the residual errors of the pixels when calculating the loss function. Training the autoencoder on facial images and validating its performance on manually labeled face and hand videos, our Deep Feature Encodings (DFE) method demonstrated superior tracking accuracy with a mean error ranging from 0.6 to 3.3 pixels, outperforming traditional methods like SIFT, SURF, Lucas Kanade, and the latest transformers like PIPs++ and CoTracker. Overall, our unsupervised learning approach excels in tracking various skin features under significant motion conditions, providing superior feature descriptors for tracking, matching, and image registration compared to both traditional and state-of-the-art supervised learning methods.
Paper Structure (27 sections, 4 equations, 7 figures, 1 table)

This paper contains 27 sections, 4 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Schematic workflow of our analyses and experiments. The UTKface (training) dataset was used to train the autoencoder used in our DFE method. Our validation dataset was manually labelled to obtain the ground truth of the location of the features. The feature tracking methods were used to predict the localisation of the skin features in the validation dataset. For SIFT, SURF, DFE, and wDFE the optimal match is determined to obtain subpixel level predictions before calculating the error relative to the manual labelling. Based on those errors; the sorted errors, cumulative tracking errors, and mean errors are reported. The SSR landscape takes the high-dimensional representations of the points as input and visualise them.
  • Figure 2: Flowchart of the algorithm for using an autoencoder for matching facial features.
  • Figure 3: Weights of the 2D Gaussian with a standard deviation of five. We plotted the reference crop of the face mole under static conditions for easier visualisation.
  • Figure 4: Sorted errors for matching the face mole under static conditions (top-left), nose tip under static conditions (top-right), face mole under bike conditions (bottom-left), and nose tip under bike conditions (bottom-right). The purple dashed line for SIFT with threshold stands for the frames where the nearest neighbor distance threshold was larger than $0.8$. Since the predictions of Cotracker are at the pixel level, errors with 0 values are not visible in the log scale plot.
  • Figure 5: Cumulative sum of standardized squared errors using the reference feature from the original image for tracking the face mole under static conditions (top-left), nose tip under static conditions (top-right), face mole under bike conditions (bottom-left), and nose tip under bike conditions (bottom-right).
  • ...and 2 more figures