Table of Contents
Fetching ...

Skeleton-in-Context: Unified Skeleton Sequence Modeling with In-Context Learning

Xinshun Wang, Zhongbin Fang, Xia Li, Xiangtai Li, Mengyuan Liu

TL;DR

Skeleton-in-Context (SiC) addresses the absence of generalist, in-context modeling for 3D skeleton sequences by introducing a task-prompted paradigm that learns multiple skeleton-based tasks without task-specific heads or fine-tuning. It deploys a Task-Guided Prompt (TGP) and a Task-Unified Prompt (TUP) within a skeleton bank and a two-stream transformer to perceive task context and perform the appropriate operation on a query, enabling zero-shot generalization to unseen tasks. The approach yields state-of-the-art multi-task performance on MP, PE, JC, and FPE benchmarks, and demonstrates robust cross-dataset and unseen-task generalization, including a novel motion-in-between task. This work provides a first step toward end-to-end in-context learning for dynamic skeleton sequences, with practical implications for unified human motion understanding across datasets and tasks.

Abstract

In-context learning provides a new perspective for multi-task modeling for vision and NLP. Under this setting, the model can perceive tasks from prompts and accomplish them without any extra task-specific head predictions or model fine-tuning. However, Skeleton sequence modeling via in-context learning remains unexplored. Directly applying existing in-context models from other areas onto skeleton sequences fails due to the inter-frame and cross-task pose similarity that makes it outstandingly hard to perceive the task correctly from a subtle context. To address this challenge, we propose Skeleton-in-Context (SiC), an effective framework for in-context skeleton sequence modeling. Our SiC is able to handle multiple skeleton-based tasks simultaneously after a single training process and accomplish each task from context according to the given prompt. It can further generalize to new, unseen tasks according to customized prompts. To facilitate context perception, we additionally propose a task-unified prompt, which adaptively learns tasks of different natures, such as partial joint-level generation, sequence-level prediction, or 2D-to-3D motion prediction. We conduct extensive experiments to evaluate the effectiveness of our SiC on multiple tasks, including motion prediction, pose estimation, joint completion, and future pose estimation. We also evaluate its generalization capability on unseen tasks such as motion-in-between. These experiments show that our model achieves state-of-the-art multi-task performance and even outperforms single-task methods on certain tasks.

Skeleton-in-Context: Unified Skeleton Sequence Modeling with In-Context Learning

TL;DR

Skeleton-in-Context (SiC) addresses the absence of generalist, in-context modeling for 3D skeleton sequences by introducing a task-prompted paradigm that learns multiple skeleton-based tasks without task-specific heads or fine-tuning. It deploys a Task-Guided Prompt (TGP) and a Task-Unified Prompt (TUP) within a skeleton bank and a two-stream transformer to perceive task context and perform the appropriate operation on a query, enabling zero-shot generalization to unseen tasks. The approach yields state-of-the-art multi-task performance on MP, PE, JC, and FPE benchmarks, and demonstrates robust cross-dataset and unseen-task generalization, including a novel motion-in-between task. This work provides a first step toward end-to-end in-context learning for dynamic skeleton sequences, with practical implications for unified human motion understanding across datasets and tasks.

Abstract

In-context learning provides a new perspective for multi-task modeling for vision and NLP. Under this setting, the model can perceive tasks from prompts and accomplish them without any extra task-specific head predictions or model fine-tuning. However, Skeleton sequence modeling via in-context learning remains unexplored. Directly applying existing in-context models from other areas onto skeleton sequences fails due to the inter-frame and cross-task pose similarity that makes it outstandingly hard to perceive the task correctly from a subtle context. To address this challenge, we propose Skeleton-in-Context (SiC), an effective framework for in-context skeleton sequence modeling. Our SiC is able to handle multiple skeleton-based tasks simultaneously after a single training process and accomplish each task from context according to the given prompt. It can further generalize to new, unseen tasks according to customized prompts. To facilitate context perception, we additionally propose a task-unified prompt, which adaptively learns tasks of different natures, such as partial joint-level generation, sequence-level prediction, or 2D-to-3D motion prediction. We conduct extensive experiments to evaluate the effectiveness of our SiC on multiple tasks, including motion prediction, pose estimation, joint completion, and future pose estimation. We also evaluate its generalization capability on unseen tasks such as motion-in-between. These experiments show that our model achieves state-of-the-art multi-task performance and even outperforms single-task methods on certain tasks.
Paper Structure (25 sections, 8 equations, 10 figures, 8 tables)

This paper contains 25 sections, 8 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: In-context learning in a) image modeling wang2023painter, b) point cloud modeling fang2023pic, and c) skeleton sequence modeling (ours).
  • Figure 2: Top: Training with MIM-style framework he2022maepang2022pointmae used in previous works wang2023painterbar2022visualpromptfang2023pic. The model is able to reconstruct the masked frames well. Bottom: During inference, the reconstructed sequence gradually shrinks the skeleton as time goes by because the model only learns frame interpolation during training. Once the model loses subsequent reference frames when interpolating frames, the generated sequence will tend to shrink.
  • Figure 3: Overall framework of our Skeleton-in-Context. Specifically, we establish a skeleton bank by integrating training sets under different tasks, which contain a large amount of input-target pairs performing different tasks. Next, we randomly select a sample pair as the task-guided prompt (TGP) and a query input from the skeleton bank, undergo encoding and concatenating, respectively, and then input them into the transformer in parallel. In particular, during this process, the query input and task-unified prompt (TUP) are combined to form a new query. After iterating $n_{1}$ times, the TGP and query are aggregated through $\mathcal{A}\left( \cdot\right)$ and then input into the transformer for $n_{2}$ iterations. Lastly, the second half of the model output is used as our prediction.
  • Figure 4: The comparison of visual results between our Skeleton-in-Context and recent SoTA method MotionBERT zhu2023motionbert on four tasks. Our method generates more accurate poses than MotionBERT, as can be seen from the spots with red arrows ($\searrow$) pointed to them.
  • Figure 5: Comparison of generalization. Our SiC completes the missing skeletons according to the customized prompt for an unseen task.
  • ...and 5 more figures