Table of Contents
Fetching ...

VLPose: Bridging the Domain Gap in Pose Estimation with Language-Vision Tuning

Jingyao Li, Pengguang Chen, Xuan Ju, Hong Xu, Jiaya Jia

TL;DR

The paper tackles the domain gap in human pose estimation between natural and artificial scenes, which hampers generalization and applications in VR/AR. It introduces VLPose, a framework that leverages language-vision tuning through a text encoder with domain prompts, a vision-language relation matcher, and a dual extractor-injector decoder to fuse image-text relations into pose estimation. Extensive ablations and experiments show substantial improvements on HumanArt and MSCOCO, with robustness across backbones and the ability to revert to original weights when desired. This work advances cross-domain HPE by integrating language models to adapt pose estimators to diverse artistic domains, broadening practical applicability.

Abstract

Thanks to advances in deep learning techniques, Human Pose Estimation (HPE) has achieved significant progress in natural scenarios. However, these models perform poorly in artificial scenarios such as painting and sculpture due to the domain gap, constraining the development of virtual reality and augmented reality. With the growth of model size, retraining the whole model on both natural and artificial data is computationally expensive and inefficient. Our research aims to bridge the domain gap between natural and artificial scenarios with efficient tuning strategies. Leveraging the potential of language models, we enhance the adaptability of traditional pose estimation models across diverse scenarios with a novel framework called VLPose. VLPose leverages the synergy between language and vision to extend the generalization and robustness of pose estimation models beyond the traditional domains. Our approach has demonstrated improvements of 2.26% and 3.74% on HumanArt and MSCOCO, respectively, compared to state-of-the-art tuning strategies.

VLPose: Bridging the Domain Gap in Pose Estimation with Language-Vision Tuning

TL;DR

The paper tackles the domain gap in human pose estimation between natural and artificial scenes, which hampers generalization and applications in VR/AR. It introduces VLPose, a framework that leverages language-vision tuning through a text encoder with domain prompts, a vision-language relation matcher, and a dual extractor-injector decoder to fuse image-text relations into pose estimation. Extensive ablations and experiments show substantial improvements on HumanArt and MSCOCO, with robustness across backbones and the ability to revert to original weights when desired. This work advances cross-domain HPE by integrating language models to adapt pose estimators to diverse artistic domains, broadening practical applicability.

Abstract

Thanks to advances in deep learning techniques, Human Pose Estimation (HPE) has achieved significant progress in natural scenarios. However, these models perform poorly in artificial scenarios such as painting and sculpture due to the domain gap, constraining the development of virtual reality and augmented reality. With the growth of model size, retraining the whole model on both natural and artificial data is computationally expensive and inefficient. Our research aims to bridge the domain gap between natural and artificial scenarios with efficient tuning strategies. Leveraging the potential of language models, we enhance the adaptability of traditional pose estimation models across diverse scenarios with a novel framework called VLPose. VLPose leverages the synergy between language and vision to extend the generalization and robustness of pose estimation models beyond the traditional domains. Our approach has demonstrated improvements of 2.26% and 3.74% on HumanArt and MSCOCO, respectively, compared to state-of-the-art tuning strategies.
Paper Structure (22 sections, 8 equations, 4 figures, 4 tables)

This paper contains 22 sections, 8 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The comparison between our proposed VLPose, pretrained ViTPose vitpose, and finetuned ViTPose. ft denotes the fine-tuned ViTPose model on HumanArt, and sc represents the scratch ViTPose model pre-trained on MSCOCO. The size of the circle indicates the corresponding size of the model.
  • Figure 2: In our framework, a text encoder is utilized to encode domain-specific information. Then, a vision-language relation matcher captures the inter-relationship between images and text and feeds this information into the dual extractor injector for pose estimation. Domain-specific information is integrated into the pose estimator to minimize performance discrepancies stemming from different domains.
  • Figure 3: A spectrum of decoder architectures: (a) The baseline decoder features two deconvolution blocks, upsampling operations, and a $1\times1$ convolution layer to generate keypoints' heatmaps. (b) The injector decoder combines image feature $E$ and image-text relation $R$, feeding them into the baseline structure. (c) The extractor-injector decoder has two branches: main and auxiliary branch. The auxiliary branch extracts features and injects relationship knowledge to improve pose estimation. (d) The dual extractor-injector decoder strengthens the interaction between main and auxiliary branches.
  • Figure 4: A visual comparison between our method and the current SOTA, ViTPose vitpose, on categories of HumanArt humanart involving real human bodies with large motion amplitudes in I(a)-II(a), 3D artistic human bodies in II(c)-II(g), and 2D artistic human bodies in III-V.