Table of Contents
Fetching ...

Understanding Human Activity with Uncertainty Measure for Novelty in Graph Convolutional Networks

Hao Xing, Darius Burschka

TL;DR

This work tackles boundary-sensitive human–object interaction recognition and segmentation under uncertainty by introducing an Uncertainty Quantified Temporal Fusion Graph Convolutional Network (UQ-TFGCN). The architecture combines an attention-based GCN encoder with a novel temporal fusion decoder, while Spectral Normalized Residual (SN-res) preserves distances in feature space to enhance out-of-distribution detection. Uncertainty is quantified using a multivariate Gaussian Process kernel over high-level features, enabling principled novelty scoring via marginal likelihoods. Experiments on Bimanual Actions and IKEA Assembly demonstrate improved boundary accuracy, segmentation performance, and robust OOD detection, albeit with increased computational demands, underscoring the method’s potential for safer human–robot interaction and online action understanding.

Abstract

Understanding human activity is a crucial aspect of developing intelligent robots, particularly in the domain of human-robot collaboration. Nevertheless, existing systems encounter challenges such as over-segmentation, attributed to errors in the up-sampling process of the decoder. In response, we introduce a promising solution: the Temporal Fusion Graph Convolutional Network. This innovative approach aims to rectify the inadequate boundary estimation of individual actions within an activity stream and mitigate the issue of over-segmentation in the temporal dimension. Moreover, systems leveraging human activity recognition frameworks for decision-making necessitate more than just the identification of actions. They require a confidence value indicative of the certainty regarding the correspondence between observations and training examples. This is crucial to prevent overly confident responses to unforeseen scenarios that were not part of the training data and may have resulted in mismatches due to weak similarity measures within the system. To address this, we propose the incorporation of a Spectral Normalized Residual connection aimed at enhancing efficient estimation of novelty in observations. This innovative approach ensures the preservation of input distance within the feature space by imposing constraints on the maximum gradients of weight updates. By limiting these gradients, we promote a more robust handling of novel situations, thereby mitigating the risks associated with overconfidence. Our methodology involves the use of a Gaussian process to quantify the distance in feature space.

Understanding Human Activity with Uncertainty Measure for Novelty in Graph Convolutional Networks

TL;DR

This work tackles boundary-sensitive human–object interaction recognition and segmentation under uncertainty by introducing an Uncertainty Quantified Temporal Fusion Graph Convolutional Network (UQ-TFGCN). The architecture combines an attention-based GCN encoder with a novel temporal fusion decoder, while Spectral Normalized Residual (SN-res) preserves distances in feature space to enhance out-of-distribution detection. Uncertainty is quantified using a multivariate Gaussian Process kernel over high-level features, enabling principled novelty scoring via marginal likelihoods. Experiments on Bimanual Actions and IKEA Assembly demonstrate improved boundary accuracy, segmentation performance, and robust OOD detection, albeit with increased computational demands, underscoring the method’s potential for safer human–robot interaction and online action understanding.

Abstract

Understanding human activity is a crucial aspect of developing intelligent robots, particularly in the domain of human-robot collaboration. Nevertheless, existing systems encounter challenges such as over-segmentation, attributed to errors in the up-sampling process of the decoder. In response, we introduce a promising solution: the Temporal Fusion Graph Convolutional Network. This innovative approach aims to rectify the inadequate boundary estimation of individual actions within an activity stream and mitigate the issue of over-segmentation in the temporal dimension. Moreover, systems leveraging human activity recognition frameworks for decision-making necessitate more than just the identification of actions. They require a confidence value indicative of the certainty regarding the correspondence between observations and training examples. This is crucial to prevent overly confident responses to unforeseen scenarios that were not part of the training data and may have resulted in mismatches due to weak similarity measures within the system. To address this, we propose the incorporation of a Spectral Normalized Residual connection aimed at enhancing efficient estimation of novelty in observations. This innovative approach ensures the preservation of input distance within the feature space by imposing constraints on the maximum gradients of weight updates. By limiting these gradients, we promote a more robust handling of novel situations, thereby mitigating the risks associated with overconfidence. Our methodology involves the use of a Gaussian process to quantify the distance in feature space.

Paper Structure

This paper contains 19 sections, 12 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Dynamic relations (depicted by dashed arrows) exist between the body parts and objects involved in the action of drinking, with the human represented by a skeleton and objects represented by square boxes. The highlighted time bar signifies active participation in the interaction by either objects or skeleton joints.
  • Figure 2: Simplify the human-object-interaction into a graph representation from the work xing2022understanding: (a) spatial graph with nodes (blue) and edges (orange) for an example in the Bimanual Actions dataset datasetKIT; (b) initial inwards adjacent matrix with skeleton inward edges (orange block), empty human-objects (red blocks) and objects-objects edges (blue block).
  • Figure 3: Framework of temporal fusion decoder including three blocks: temporal feature extractor, feature fusion, and classifier. The ”Concate” block concatenates all feature maps from temporal pyramid pooling (TPP) layers into one. ”DS Conv” represents a depth-wise 2D convolutional layer.
  • Figure 4: Structure of the Spatial-Temporal Graph Convolutional block. (a) A spatial graph convolutional layer with the attention map $\bm{A}_{att}$ and a trainable parameter $a$ for adaptive update of the adjacency matrix $\bm{A}$. (b) A basic block unit consisting of the spatial convolutional layer (see a), Batch Normalization, temporal layer, and ReLU activation function with a residual side branch.
  • Figure 5: The Gaussian Process (GP) kernel collects high dimensional features from the network before the predictor and gives predictions with probabilities.
  • ...and 5 more figures