VCHAR:Variance-Driven Complex Human Activity Recognition framework with Generative Representation

Yuan Sun; Navid Salami Pargoo; Taqiya Ehsan; Zhao Zhang; Jorge Ortiz

VCHAR:Variance-Driven Complex Human Activity Recognition framework with Generative Representation

Yuan Sun, Navid Salami Pargoo, Taqiya Ehsan, Zhao Zhang, Jorge Ortiz

TL;DR

This work tackles the challenge of complex human activity recognition under weak labeling in smart spaces by introducing VCHAR, a variance-driven framework that treats atomic outputs as distributions over time intervals and learns via a multitask objective that combines atomic and complex activity losses. A generative decoder, guided by a sensor-based foundation model and one-shot diffusion-based tuning, provides video-based explanations that are accessible to laypersons while LM/VLM components organize information for visualization. Across Opportunity, FallAllD, and Cooking Activity datasets, VCHAR achieves competitive complex-activity recognition (CHAR F1) while providing explainability that outperforms baselines in user studies. The approach reduces labeling requirements and enhances practical applicability in real-world smart environments, though real-time rendering and cross-domain integration remain areas for improvement.

Abstract

Complex human activity recognition (CHAR) remains a pivotal challenge within ubiquitous computing, especially in the context of smart environments. Existing studies typically require meticulous labeling of both atomic and complex activities, a task that is labor-intensive and prone to errors due to the scarcity and inaccuracies of available datasets. Most prior research has focused on datasets that either precisely label atomic activities or, at minimum, their sequence approaches that are often impractical in real world settings.In response, we introduce VCHAR (Variance-Driven Complex Human Activity Recognition), a novel framework that treats the outputs of atomic activities as a distribution over specified intervals. Leveraging generative methodologies, VCHAR elucidates the reasoning behind complex activity classifications through video-based explanations, accessible to users without prior machine learning expertise. Our evaluation across three publicly available datasets demonstrates that VCHAR enhances the accuracy of complex activity recognition without necessitating precise temporal or sequential labeling of atomic activities. Furthermore, user studies confirm that VCHAR's explanations are more intelligible compared to existing methods, facilitating a broader understanding of complex activity recognition among non-experts.

VCHAR:Variance-Driven Complex Human Activity Recognition framework with Generative Representation

TL;DR

Abstract

Paper Structure (32 sections, 8 equations, 17 figures, 4 tables)

This paper contains 32 sections, 8 equations, 17 figures, 4 tables.

Introduction
Challenges in Complex Human Activity Recognition
Contributions
Related work
Smart Space Complex Activity Recognition
Visual Representation of Sensor Data
Foundation and Multimodal Models
Research Methods
Outline of the VCHAR Framework
Multi-Task Learning for Complex Activity Recognition
Loss of Atomic Activity Recognition
Loss of Complex Activity Recognition
Sensor Encoder Architecture
Generative Modeling for Enhanced Complex Activity Representation
Pretraining a General-Purpose Sensor-Based Foundation Model Framework for CHAR
...and 17 more sections

Figures (17)

Figure 1: Standard complex activity datasets (left) typically provide detailed labels for each time interval to facilitate atomic activity training. In contrast, in-the-wild datasets(middle), constrained by labor capacity and other practical limitations, only specify the types of complex and atomic activities per segment without specific time interval or detailed atomic activity labels for time series segmentation. It often feature a greater variety of label combinations(right), reflecting the complexity and unpredictability of real-world scenarios.
Figure 2: The VCHAR framework is tailored for recognizing both atomic and complex activities without precise temporal annotations. The framework leverages sensor encoder outputs to generate visual representations at 24 fps, thereby enhancing user understanding. For instance, one video output depicts how the left foot sensor crucially detects activities. Another segment illustrates the atomic activity "open the dishwasher" as part of the broader "clean up" complex activity.
Figure 3: The VCHAR framework integrates various types of sensor data and utilizes a LLM agent to structure this information for a generative decoder. The framework is designed to enhance user comprehension by providing additional insights such as sensor channel activation values, offering a deeper understanding of the events occurring within the smart space.
Figure 4: For comparative analysis, the sensor encoder structure predominantly utilizes the ConvLSTM module. This encoder is designed to identify possible atomic activities and their associated complex activities occurring within the smart space.
Figure 5: The complex activity prompt learning decoder is engineered to master key elements including scenario descriptions, concept relationships, and detailed activity insights. This decoder is designed for adaptability, employing a one-shot tuning strategy to seamlessly integrate with specific datasets, thereby enhancing its versatility across various settings. The example shown in the graphs illustrates how this framework can be applied to newly identified activities, such as "clean the table," allowing the decoder to generate an intricate description of "making sandwich" tailored to the new dataset.
...and 12 more figures

VCHAR:Variance-Driven Complex Human Activity Recognition framework with Generative Representation

TL;DR

Abstract

VCHAR:Variance-Driven Complex Human Activity Recognition framework with Generative Representation

Authors

TL;DR

Abstract

Table of Contents

Figures (17)