Computerized Assessment of Motor Imitation for Distinguishing Autism in Video (CAMI-2DNet)

Kaleab A. Kinfu; Carolina Pacheco; Alice D. Sperry; Deana Crocetti; Bahar Tunçgenç; Stewart H. Mostofsky; René Vidal

Computerized Assessment of Motor Imitation for Distinguishing Autism in Video (CAMI-2DNet)

Kaleab A. Kinfu, Carolina Pacheco, Alice D. Sperry, Deana Crocetti, Bahar Tunçgenç, Stewart H. Mostofsky, René Vidal

TL;DR

This paper tackles the heterogeneity of autism by automating motor imitation assessment from video. It introduces CAMI-2DNet, a disentangled representation learning framework that maps 2D pose sequences to motion, shape, and viewpoint encodings, trained with synthetic motion retargeting and real participant data, and evaluates imitation via a cosine-similarity score between actor and imit encodings after DTW alignment. The approach achieves diagnostic-relevant performance comparable to CAMI-3D while offering greater practicality by operating directly on video data without ad hoc normalization or human annotations, and provides interpretable, localized scores by body segment. These results demonstrate CAMI-2DNet as a scalable, objective, and interpretable tool for motor imitation assessment in autism with potential for broader clinical deployment and personalized interventions.

Abstract

Motor imitation impairments are commonly reported in individuals with autism spectrum conditions (ASCs), suggesting that motor imitation could be used as a phenotype for addressing autism heterogeneity. Traditional methods for assessing motor imitation are subjective, labor-intensive, and require extensive human training. Modern Computerized Assessment of Motor Imitation (CAMI) methods, such as CAMI-3D for motion capture data and CAMI-2D for video data, are less subjective. However, they rely on labor-intensive data normalization and cleaning techniques, and human annotations for algorithm training. To address these challenges, we propose CAMI-2DNet, a scalable and interpretable deep learning-based approach to motor imitation assessment in video data, which eliminates the need for data normalization, cleaning and annotation. CAMI-2DNet uses an encoder-decoder architecture to map a video to a motion encoding that is disentangled from nuisance factors such as body shape and camera views. To learn a disentangled representation, we employ synthetic data generated by motion retargeting of virtual characters through the reshuffling of motion, body shape, and camera views, as well as real participant data. To automatically assess how well an individual imitates an actor, we compute a similarity score between their motion encodings, and use it to discriminate individuals with ASCs from neurotypical (NT) individuals. Our comparative analysis demonstrates that CAMI-2DNet has a strong correlation with human scores while outperforming CAMI-2D in discriminating ASC vs NT children. Moreover, CAMI-2DNet performs comparably to CAMI-3D while offering greater practicality by operating directly on video data and without the need for ad-hoc data normalization and human annotations.

Computerized Assessment of Motor Imitation for Distinguishing Autism in Video (CAMI-2DNet)

TL;DR

Abstract

Paper Structure (55 sections, 16 equations, 5 figures, 7 tables)

This paper contains 55 sections, 16 equations, 5 figures, 7 tables.

Introduction
Overview of CAMI-2DNet
Stages of CAMI-2DNet
Estimating 2D Body Pose
Learning Disentangled Representations
Computing Motion Similarity
Learning Disentangled Representations
Model Architecture
Encoding
Decoding
Training Data
Motion Retargeting
Integrating Synthetic and Real Participant Data
Training Objectives
Disentanglement Losses
...and 40 more sections

Figures (5)

Figure 1: Overview of CAMI-2DNet. (a) Given videos of an actor performing a target action and an individual imitating it, CAMI-2DNet extracts 2D joint positions using a pose estimation network and encodes the joint trajectories into disentangled motion, shape, and viewpoint components. The imitation score is computed by calculating the cosine similarity of the motion representations ($M_a$ for the actor and $M_i$ for the individual). (b) During training, the model learns these disentangled representations from synthetic data generated via motion retargeting (varying motion, shape, and viewpoint) and real participant data from neurotypical individuals and individuals with ASCs. The encoder-decoder architecture is optimized using reconstruction and disentanglement losses, ensuring effective encoding and disentanglement of motion, shape, and viewpoint.
Figure 2: Comparing CAMI-2DNet, CAMI-2D, CAMI-3D, and human observation coding (HOC) on the CAMI-47 dataset (27 ASCs, 20 NT). (a) Correlation with HOC Scores: Scatter plots showing the correlation between HOC scores and the scores from CAMI-3D, CAMI-2D, and CAMI-2DNet. CAMI-2DNet has the highest correlation with HOC scores. (b) ROC Curve for Both Sequences: Receiver operating characteristic (ROC) curve: true positive rate vs. false positive rate as classification threshold is varied. The Area Under the Curve (AUC) indicates the diagnostic ability of the different methods. CAMI-2DNet (AUC = 0.843) demonstrates comparable performance to CAMI-3D (AUC = 0.859) and superior performance over both HOC (AUC = 0.792) and CAMI-2D (AUC = 0.789). (c) Violin Plot of Scores: The violin plots illustrate the distribution of scores for ASC and NT groups across the four methods. CAMI-2DNet not only shows a clear separation between the ASC groups but also displays less variability within each group, highlighting its robustness and reliability.
Figure 3: Receiver Operating Characteristic (ROC) curves comparing the diagnostic performance of HOC, CAMI-3D, CAMI-2D, and CAMI-2DNet across two datasets: CAMI-47 and CAMI-185. The top row (a-d) presents results on the CAMI-47 dataset for two sequences, each consisting of two trials. CAMI-2DNet consistently outperforms HOC and CAMI-2D and demonstrates comparable or superior performance to CAMI-3D. The bottom row (e-h) shows results on the CAMI-185 dataset, comparing CAMI-2DNet with CAMI-2D across two sequences and two trials. CAMI-2DNet achieves higher diagnostic accuracy in all trials, demonstrating a higher AUC than CAMI-2D.
Figure 4: Visualization of localized motion imitation scores for body segments, comparing the actor (left) and children's imitation (right). (a) Top: The scores indicate low similarity for arms (left: 0.35, right: 0.32), with high alignment for the torso (0.92), left leg (0.99), and right leg (0.98). (b) Bottom: Higher similarity for the left arm (0.95) but lower for the right arm (0.46). Torso (0.94), left leg (0.99), and right leg (0.99) maintain high alignment. Red highlights low alignment below a threshold, while green indicates high alignment.
Figure 5: Reconstruction error comparison across various movement types for Sequence 1 (a) and Sequence 2 (b). Lower values indicate better reconstruction quality. Errors are computed in a pixel space normalized to the [0,1] range. CAMI-2DNet consistently outperforms BPE Park2021ABP, demonstrating greater accuracy and robustness in reconstructing diverse motion patterns in CAMI-185.

Computerized Assessment of Motor Imitation for Distinguishing Autism in Video (CAMI-2DNet)

TL;DR

Abstract

Computerized Assessment of Motor Imitation for Distinguishing Autism in Video (CAMI-2DNet)

Authors

TL;DR

Abstract

Table of Contents

Figures (5)