SKI Models: Skeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living
Arkaprava Sinha, Dominick Reilly, Francois Bremond, Pu Wang, Srijan Das
TL;DR
This work tackles the challenge of understanding Activities of Daily Living (ADL) with vision-language models by bridging the gap between skeleton-based motion cues and language supervision. It introduces SkeletonCLIP as a skeleton-language backbone and builds SKI-VLM and SKI-LVLM by integrating SkeletonCLIP with VLMs and LVLMs through online distillation and projection-based fusion, respectively. The approach yields state-of-the-art zero-shot action recognition on NTU60/NTU120 and improves dense captioning on Charades, while requiring no skeleton data at inference. The results demonstrate that incorporating language-grounded skeleton information enhances ADL understanding and offers practical benefits for robust, skeleton-informed video understanding in real-world scenarios.
Abstract
The introduction of vision-language models like CLIP has enabled the development of foundational video models capable of generalizing to unseen videos and human actions. However, these models are typically trained on web videos, which often fail to capture the challenges present in Activities of Daily Living (ADL) videos. Existing works address ADL-specific challenges, such as similar appearances, subtle motion patterns, and multiple viewpoints, by combining 3D skeletons and RGB videos. However, these approaches are not integrated with language, limiting their ability to generalize to unseen action classes. In this paper, we introduce SKI models, which integrate 3D skeletons into the vision-language embedding space. SKI models leverage a skeleton-language model, SkeletonCLIP, to infuse skeleton information into Vision Language Models (VLMs) and Large Vision Language Models (LVLMs) through collaborative training. Notably, SKI models do not require skeleton data during inference, enhancing their robustness for real-world applications. The effectiveness of SKI models is validated on three popular ADL datasets for zero-shot action recognition and video caption generation tasks.
