Computerized Assessment of Motor Imitation for Distinguishing Autism in Video (CAMI-2DNet)
Kaleab A. Kinfu, Carolina Pacheco, Alice D. Sperry, Deana Crocetti, Bahar Tunçgenç, Stewart H. Mostofsky, René Vidal
TL;DR
This paper tackles the heterogeneity of autism by automating motor imitation assessment from video. It introduces CAMI-2DNet, a disentangled representation learning framework that maps 2D pose sequences to motion, shape, and viewpoint encodings, trained with synthetic motion retargeting and real participant data, and evaluates imitation via a cosine-similarity score between actor and imit encodings after DTW alignment. The approach achieves diagnostic-relevant performance comparable to CAMI-3D while offering greater practicality by operating directly on video data without ad hoc normalization or human annotations, and provides interpretable, localized scores by body segment. These results demonstrate CAMI-2DNet as a scalable, objective, and interpretable tool for motor imitation assessment in autism with potential for broader clinical deployment and personalized interventions.
Abstract
Motor imitation impairments are commonly reported in individuals with autism spectrum conditions (ASCs), suggesting that motor imitation could be used as a phenotype for addressing autism heterogeneity. Traditional methods for assessing motor imitation are subjective, labor-intensive, and require extensive human training. Modern Computerized Assessment of Motor Imitation (CAMI) methods, such as CAMI-3D for motion capture data and CAMI-2D for video data, are less subjective. However, they rely on labor-intensive data normalization and cleaning techniques, and human annotations for algorithm training. To address these challenges, we propose CAMI-2DNet, a scalable and interpretable deep learning-based approach to motor imitation assessment in video data, which eliminates the need for data normalization, cleaning and annotation. CAMI-2DNet uses an encoder-decoder architecture to map a video to a motion encoding that is disentangled from nuisance factors such as body shape and camera views. To learn a disentangled representation, we employ synthetic data generated by motion retargeting of virtual characters through the reshuffling of motion, body shape, and camera views, as well as real participant data. To automatically assess how well an individual imitates an actor, we compute a similarity score between their motion encodings, and use it to discriminate individuals with ASCs from neurotypical (NT) individuals. Our comparative analysis demonstrates that CAMI-2DNet has a strong correlation with human scores while outperforming CAMI-2D in discriminating ASC vs NT children. Moreover, CAMI-2DNet performs comparably to CAMI-3D while offering greater practicality by operating directly on video data and without the need for ad-hoc data normalization and human annotations.
