Table of Contents
Fetching ...

LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living

Dominick Reilly, Rajatsubhra Chakraborty, Arkaprava Sinha, Manish Kumar Govind, Pu Wang, Francois Bremond, Le Xue, Srijan Das

TL;DR

LLAVIDAL addresses the limitations of web-video trained LLVMs for Activities of Daily Living by introducing ADL-X, a multiview RGBS instruction-tuning dataset, and a multimodal LLVM that fuses videos, 3D skeletons, and HOIs using a Multimodal Progressive (MMPro) training curriculum. The authors also propose the ADL MCQ and video description benchmarks and demonstrate state-of-the-art performance on ADL tasks when trained on ADL-X. This work advances fine-grained, view-invariant ADL understanding in vision-language models and provides a scalable framework for incorporating domain-specific multimodal cues. Public release of code, data, and features will facilitate further research and practical deployment in ADL-aware AI systems.

Abstract

Current Large Language Vision Models (LLVMs) trained on web videos perform well in general video understanding but struggle with fine-grained details, complex human-object interactions (HOI), and view-invariant representation learning essential for Activities of Daily Living (ADL). This limitation stems from a lack of specialized ADL video instruction-tuning datasets and insufficient modality integration to capture discriminative action representations. To address this, we propose a semi-automated framework for curating ADL datasets, creating ADL-X, a multiview, multimodal RGBS instruction-tuning dataset. Additionally, we introduce LLAVIDAL, an LLVM integrating videos, 3D skeletons, and HOIs to model ADL's complex spatiotemporal relationships. For training LLAVIDAL a simple joint alignment of all modalities yields suboptimal results; thus, we propose a Multimodal Progressive (MMPro) training strategy, incorporating modalities in stages following a curriculum. We also establish ADL MCQ and video description benchmarks to assess LLVM performance in ADL tasks. Trained on ADL-X, LLAVIDAL achieves state-of-the-art performance across ADL benchmarks. Code and data will be made publicly available at: https://adl-x.github.io/.

LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living

TL;DR

LLAVIDAL addresses the limitations of web-video trained LLVMs for Activities of Daily Living by introducing ADL-X, a multiview RGBS instruction-tuning dataset, and a multimodal LLVM that fuses videos, 3D skeletons, and HOIs using a Multimodal Progressive (MMPro) training curriculum. The authors also propose the ADL MCQ and video description benchmarks and demonstrate state-of-the-art performance on ADL tasks when trained on ADL-X. This work advances fine-grained, view-invariant ADL understanding in vision-language models and provides a scalable framework for incorporating domain-specific multimodal cues. Public release of code, data, and features will facilitate further research and practical deployment in ADL-aware AI systems.

Abstract

Current Large Language Vision Models (LLVMs) trained on web videos perform well in general video understanding but struggle with fine-grained details, complex human-object interactions (HOI), and view-invariant representation learning essential for Activities of Daily Living (ADL). This limitation stems from a lack of specialized ADL video instruction-tuning datasets and insufficient modality integration to capture discriminative action representations. To address this, we propose a semi-automated framework for curating ADL datasets, creating ADL-X, a multiview, multimodal RGBS instruction-tuning dataset. Additionally, we introduce LLAVIDAL, an LLVM integrating videos, 3D skeletons, and HOIs to model ADL's complex spatiotemporal relationships. For training LLAVIDAL a simple joint alignment of all modalities yields suboptimal results; thus, we propose a Multimodal Progressive (MMPro) training strategy, incorporating modalities in stages following a curriculum. We also establish ADL MCQ and video description benchmarks to assess LLVM performance in ADL tasks. Trained on ADL-X, LLAVIDAL achieves state-of-the-art performance across ADL benchmarks. Code and data will be made publicly available at: https://adl-x.github.io/.
Paper Structure (27 sections, 1 equation, 15 figures, 5 tables)

This paper contains 27 sections, 1 equation, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Web-video trained Large Language Vision Models (LLVM) struggle to understand the fine-grained details and human-object interactions present in Activities of Daily Living (ADL). We propose LLAVIDAL, an LLVM trained with three modalities on our curated ADL-X dataset. ADL-X is derived from trimmed, multi-view ADL videos and is augmented with skeleton and object modalities.
  • Figure 2: ADL-X dataset curation pipeline. The ADL-X dataset is derived from the NTU RGB+D 120 dataset through the use of three techniques: Person Augmented Generation, Temporal Stitching, and Weakly Supervised Video Descriptions. The pipeline leverages CogVLM cogvlm for frame-level caption generation and GPT-3.5 Turbo gpt for summary synthesis and question-answer pair generation.
  • Figure 3: The multiple modalities of ADL-X. Left: Extraction of skeleton data features ($\mathcal{M}_s$) using SkeletonCLIP; Middle: Pipeline for extracting Human-Object Interaction (HOI) features through action-conditioned object detection, localization, and tracking; Right: Outline of our approach for obtaining $\mathcal{M}_m$ as QA and $\mathcal{M}_m$ as context.
  • Figure 4: MMPro training. Our proposed three-stage progressive training pipeline used to train LLAVIDAL. Stage 1 initializes independent projections for skeleton, object, and video features. Stage 2 combines skeleton and video modalities. Stage 3 integrates all modalities into the final model. Large hollow arrows indicate weight transfer between stages.
  • Figure 5: Qualitative results comparing LLAVIDAL with SOTA models.Incorrect descriptions are marked in red.
  • ...and 10 more figures