Table of Contents
Fetching ...

SKI Models: Skeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living

Arkaprava Sinha, Dominick Reilly, Francois Bremond, Pu Wang, Srijan Das

TL;DR

This work tackles the challenge of understanding Activities of Daily Living (ADL) with vision-language models by bridging the gap between skeleton-based motion cues and language supervision. It introduces SkeletonCLIP as a skeleton-language backbone and builds SKI-VLM and SKI-LVLM by integrating SkeletonCLIP with VLMs and LVLMs through online distillation and projection-based fusion, respectively. The approach yields state-of-the-art zero-shot action recognition on NTU60/NTU120 and improves dense captioning on Charades, while requiring no skeleton data at inference. The results demonstrate that incorporating language-grounded skeleton information enhances ADL understanding and offers practical benefits for robust, skeleton-informed video understanding in real-world scenarios.

Abstract

The introduction of vision-language models like CLIP has enabled the development of foundational video models capable of generalizing to unseen videos and human actions. However, these models are typically trained on web videos, which often fail to capture the challenges present in Activities of Daily Living (ADL) videos. Existing works address ADL-specific challenges, such as similar appearances, subtle motion patterns, and multiple viewpoints, by combining 3D skeletons and RGB videos. However, these approaches are not integrated with language, limiting their ability to generalize to unseen action classes. In this paper, we introduce SKI models, which integrate 3D skeletons into the vision-language embedding space. SKI models leverage a skeleton-language model, SkeletonCLIP, to infuse skeleton information into Vision Language Models (VLMs) and Large Vision Language Models (LVLMs) through collaborative training. Notably, SKI models do not require skeleton data during inference, enhancing their robustness for real-world applications. The effectiveness of SKI models is validated on three popular ADL datasets for zero-shot action recognition and video caption generation tasks.

SKI Models: Skeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living

TL;DR

This work tackles the challenge of understanding Activities of Daily Living (ADL) with vision-language models by bridging the gap between skeleton-based motion cues and language supervision. It introduces SkeletonCLIP as a skeleton-language backbone and builds SKI-VLM and SKI-LVLM by integrating SkeletonCLIP with VLMs and LVLMs through online distillation and projection-based fusion, respectively. The approach yields state-of-the-art zero-shot action recognition on NTU60/NTU120 and improves dense captioning on Charades, while requiring no skeleton data at inference. The results demonstrate that incorporating language-grounded skeleton information enhances ADL understanding and offers practical benefits for robust, skeleton-informed video understanding in real-world scenarios.

Abstract

The introduction of vision-language models like CLIP has enabled the development of foundational video models capable of generalizing to unseen videos and human actions. However, these models are typically trained on web videos, which often fail to capture the challenges present in Activities of Daily Living (ADL) videos. Existing works address ADL-specific challenges, such as similar appearances, subtle motion patterns, and multiple viewpoints, by combining 3D skeletons and RGB videos. However, these approaches are not integrated with language, limiting their ability to generalize to unseen action classes. In this paper, we introduce SKI models, which integrate 3D skeletons into the vision-language embedding space. SKI models leverage a skeleton-language model, SkeletonCLIP, to infuse skeleton information into Vision Language Models (VLMs) and Large Vision Language Models (LVLMs) through collaborative training. Notably, SKI models do not require skeleton data during inference, enhancing their robustness for real-world applications. The effectiveness of SKI models is validated on three popular ADL datasets for zero-shot action recognition and video caption generation tasks.

Paper Structure

This paper contains 19 sections, 4 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Left: The illustration depicts an embedding space of a Vision-Language Model (VLM) where representations of web-based videos align closely with their corresponding class label text features, while those of Activities of Daily Living (ADL) videos remain distant. Our study reveals that integrating skeleton guidance bridges this gap, aligning ADL video representations with their respective class labels. Right: Activation maps demonstrate how skeleton guidance sharpens the model’s focus on the critical body parts (such as legs) for specific actions, like Walk. This enhancement is evident in the improved text descriptions generated by Large-Vision-Language Models (LVLMs) when queried about actions depicted in the videos.
  • Figure 2: (a) SkeletonCLIP: Utilizes a pretrained Skeleton Backbone, aligned with action class labels from the frozen CLIP Text Encoder during Skeleton-Language Supervision. (b) SKI-VLM: Engages in online distillation between SkeletonCLIP and a Vision-Language Model (VLM), both trainable. For inference, the VLM alone performs zero-shot action recognition on unseen classes. (c) SKI-LVLM: Projects SkeletonCLIP features into LLM space along with video features. Only the projection layers are trainable, while SkeletonCLIP, CLIP encoder, and LLM are frozen. Inference uses the CLIP vision encoder to extract video features, which, together with the user query, are input to the LLM to generate a response based on the video content. Trainable Frozen
  • Figure 3: Attention Map Visualization: Comparison between ViFiCLIP and SKI-VLM. While ViFiCLIP struggles to identify the critical areas responsible for actions, SKI-VLM accurately focuses on the relevant joints, such as hands and face, for actions like Sneeze/Cough.
  • Figure 4: Illustration of how SkeletonCLIP can be easily integrated with VLMs like ViFiCLIP, FROSTER and XCLIP (left to right)
  • Figure 5: Impact of $\alpha$ in SKI-VLM for NTU48 and NTU110