Learning Transformation-Isomorphic Latent Space for Accurate Hand Pose Estimation
Kaiwen Ren, Lei Hu, Zhiheng Zhang, Yongjing Ye, Shihong Xia
TL;DR
This work tackles the challenge of hand pose estimation by rethinking representation learning through transformation isomorphism: aligning transformations across image, latent, and pose spaces. TI-Net learns a transformation-consistent latent space using lightweight latent transformers and rotation-embedding to ensure the latent features reflect pose-related, low-level information. Through pretraining that combines reconstruction with ordinary and secondary transformation constraints, TI-Net achieves state-of-the-art PA-MPJPE on DexYCB and strong results on InterHand2.6M, while also offering faster convergence and easy integration with existing pose estimation frameworks. The approach demonstrates that transformation-consistent, compact latent representations can significantly improve regression performance in vision tasks and invites extension to other regression problems beyond hand pose estimation.
Abstract
Vision-based regression tasks, such as hand pose estimation, have achieved higher accuracy and faster convergence through representation learning. However, existing representation learning methods often encounter the following issues: the high semantic level of features extracted from images is inadequate for regressing low-level information, and the extracted features include task-irrelevant information, reducing their compactness and interfering with regression tasks. To address these challenges, we propose TI-Net, a highly versatile visual Network backbone designed to construct a Transformation Isomorphic latent space. Specifically, we employ linear transformations to model geometric transformations in the latent space and ensure that {\rm TI-Net} aligns them with those in the image space. This ensures that the latent features capture compact, low-level information beneficial for pose estimation tasks. We evaluated TI-Net on the hand pose estimation task to demonstrate the network's superiority. On the DexYCB dataset, TI-Net achieved a 10% improvement in the PA-MPJPE metric compared to specialized state-of-the-art (SOTA) hand pose estimation methods. Our code will be released in the future.
