Extending 3D body pose estimation for robotic-assistive therapies of autistic children
Laura Santos, Bernardo Carvalho, Catarina Barata, José Santos-Victor
TL;DR
This work addresses accurate, non-intrusive 3D pose estimation for autistic children in robotic-assisted therapy by adapting a state-of-the-art CRMH-based 3D reconstruction pipeline to children. It personalizes the focal length input via a height-based regression, learned with RANSAC, to correct child-specific depth translation and enable both offline pose recovery and online interaction. In controlled experiments, the proposed CRMH-p achieves 3D errors below $0.3$ m, outperforming a BEV baseline, and in real therapy sessions it recovers skeletons missed by Kinect while maintaining competitive pose orientation accuracy. The approach offers a practical route to reliable pose estimation in occluded, unconstrained therapy settings, with future work aiming to optimize acquisition geometry and realize online deployment.
Abstract
Robotic-assistive therapy has demonstrated very encouraging results for children with Autism. Accurate estimation of the child's pose is essential both for human-robot interaction and for therapy assessment purposes. Non-intrusive methods are the sole viable option since these children are sensitive to touch. While depth cameras have been used extensively, existing methods face two major limitations: (i) they are usually trained with adult-only data and do not correctly estimate a child's pose, and (ii) they fail in scenarios with a high number of occlusions. Therefore, our goal was to develop a 3D pose estimator for children, by adapting an existing state-of-the-art 3D body modelling method and incorporating a linear regression model to fine-tune one of its inputs, thereby correcting the pose of children's 3D meshes. In controlled settings, our method has an error below $0.3m$, which is considered acceptable for this kind of application and lower than current state-of-the-art methods. In real-world settings, the proposed model performs similarly to a Kinect depth camera and manages to successfully estimate the 3D body poses in a much higher number of frames.
