Table of Contents
Fetching ...

VCHAR:Variance-Driven Complex Human Activity Recognition framework with Generative Representation

Yuan Sun, Navid Salami Pargoo, Taqiya Ehsan, Zhao Zhang, Jorge Ortiz

TL;DR

This work tackles the challenge of complex human activity recognition under weak labeling in smart spaces by introducing VCHAR, a variance-driven framework that treats atomic outputs as distributions over time intervals and learns via a multitask objective that combines atomic and complex activity losses. A generative decoder, guided by a sensor-based foundation model and one-shot diffusion-based tuning, provides video-based explanations that are accessible to laypersons while LM/VLM components organize information for visualization. Across Opportunity, FallAllD, and Cooking Activity datasets, VCHAR achieves competitive complex-activity recognition (CHAR F1) while providing explainability that outperforms baselines in user studies. The approach reduces labeling requirements and enhances practical applicability in real-world smart environments, though real-time rendering and cross-domain integration remain areas for improvement.

Abstract

Complex human activity recognition (CHAR) remains a pivotal challenge within ubiquitous computing, especially in the context of smart environments. Existing studies typically require meticulous labeling of both atomic and complex activities, a task that is labor-intensive and prone to errors due to the scarcity and inaccuracies of available datasets. Most prior research has focused on datasets that either precisely label atomic activities or, at minimum, their sequence approaches that are often impractical in real world settings.In response, we introduce VCHAR (Variance-Driven Complex Human Activity Recognition), a novel framework that treats the outputs of atomic activities as a distribution over specified intervals. Leveraging generative methodologies, VCHAR elucidates the reasoning behind complex activity classifications through video-based explanations, accessible to users without prior machine learning expertise. Our evaluation across three publicly available datasets demonstrates that VCHAR enhances the accuracy of complex activity recognition without necessitating precise temporal or sequential labeling of atomic activities. Furthermore, user studies confirm that VCHAR's explanations are more intelligible compared to existing methods, facilitating a broader understanding of complex activity recognition among non-experts.

VCHAR:Variance-Driven Complex Human Activity Recognition framework with Generative Representation

TL;DR

This work tackles the challenge of complex human activity recognition under weak labeling in smart spaces by introducing VCHAR, a variance-driven framework that treats atomic outputs as distributions over time intervals and learns via a multitask objective that combines atomic and complex activity losses. A generative decoder, guided by a sensor-based foundation model and one-shot diffusion-based tuning, provides video-based explanations that are accessible to laypersons while LM/VLM components organize information for visualization. Across Opportunity, FallAllD, and Cooking Activity datasets, VCHAR achieves competitive complex-activity recognition (CHAR F1) while providing explainability that outperforms baselines in user studies. The approach reduces labeling requirements and enhances practical applicability in real-world smart environments, though real-time rendering and cross-domain integration remain areas for improvement.

Abstract

Complex human activity recognition (CHAR) remains a pivotal challenge within ubiquitous computing, especially in the context of smart environments. Existing studies typically require meticulous labeling of both atomic and complex activities, a task that is labor-intensive and prone to errors due to the scarcity and inaccuracies of available datasets. Most prior research has focused on datasets that either precisely label atomic activities or, at minimum, their sequence approaches that are often impractical in real world settings.In response, we introduce VCHAR (Variance-Driven Complex Human Activity Recognition), a novel framework that treats the outputs of atomic activities as a distribution over specified intervals. Leveraging generative methodologies, VCHAR elucidates the reasoning behind complex activity classifications through video-based explanations, accessible to users without prior machine learning expertise. Our evaluation across three publicly available datasets demonstrates that VCHAR enhances the accuracy of complex activity recognition without necessitating precise temporal or sequential labeling of atomic activities. Furthermore, user studies confirm that VCHAR's explanations are more intelligible compared to existing methods, facilitating a broader understanding of complex activity recognition among non-experts.
Paper Structure (32 sections, 8 equations, 17 figures, 4 tables)

This paper contains 32 sections, 8 equations, 17 figures, 4 tables.

Figures (17)

  • Figure 1: Standard complex activity datasets (left) typically provide detailed labels for each time interval to facilitate atomic activity training. In contrast, in-the-wild datasets(middle), constrained by labor capacity and other practical limitations, only specify the types of complex and atomic activities per segment without specific time interval or detailed atomic activity labels for time series segmentation. It often feature a greater variety of label combinations(right), reflecting the complexity and unpredictability of real-world scenarios.
  • Figure 2: The VCHAR framework is tailored for recognizing both atomic and complex activities without precise temporal annotations. The framework leverages sensor encoder outputs to generate visual representations at 24 fps, thereby enhancing user understanding. For instance, one video output depicts how the left foot sensor crucially detects activities. Another segment illustrates the atomic activity "open the dishwasher" as part of the broader "clean up" complex activity.
  • Figure 3: The VCHAR framework integrates various types of sensor data and utilizes a LLM agent to structure this information for a generative decoder. The framework is designed to enhance user comprehension by providing additional insights such as sensor channel activation values, offering a deeper understanding of the events occurring within the smart space.
  • Figure 4: For comparative analysis, the sensor encoder structure predominantly utilizes the ConvLSTM module. This encoder is designed to identify possible atomic activities and their associated complex activities occurring within the smart space.
  • Figure 5: The complex activity prompt learning decoder is engineered to master key elements including scenario descriptions, concept relationships, and detailed activity insights. This decoder is designed for adaptability, employing a one-shot tuning strategy to seamlessly integrate with specific datasets, thereby enhancing its versatility across various settings. The example shown in the graphs illustrates how this framework can be applied to newly identified activities, such as "clean the table," allowing the decoder to generate an intricate description of "making sandwich" tailored to the new dataset.
  • ...and 12 more figures