Table of Contents
Fetching ...

Meet JEANIE: a Similarity Measure for 3D Skeleton Sequences via Temporal-Viewpoint Alignment

Lei Wang, Jun Liu, Liang Zheng, Tom Gedeon, Piotr Koniusz

TL;DR

JEANIE tackles temporal-viewpoint misalignment in skeletal few-shot action recognition by introducing a joint temporal and viewpoint alignment measure that extends soft-DTW to a 4D alignment space across multiple simulated views. It couples an Encoding Network with GNN backbones to extract block-wise skeleton features and uses an extended transportation plan with a smooth 1-max shift to compute the JEANIE distance, selecting the optimal joint alignment path. The framework supports supervised FSAR, unsupervised dictionary-based FSAR, and four fusion strategies, achieving state-of-the-art results on NTU-60, NTU-120, Kinetics-skeleton, and UWA3D MV II, with notable gains from view-aware augmentation and joint alignment. The approach demonstrates robust performance under view variation and action-speed variability, offering strong practical impact for rapid adaptation to novel actions with limited data.

Abstract

Video sequences exhibit significant nuisance variations (undesired effects) of speed of actions, temporal locations, and subjects' poses, leading to temporal-viewpoint misalignment when comparing two sets of frames or evaluating the similarity of two sequences. Thus, we propose Joint tEmporal and cAmera viewpoiNt alIgnmEnt (JEANIE) for sequence pairs. In particular, we focus on 3D skeleton sequences whose camera and subjects' poses can be easily manipulated in 3D. We evaluate JEANIE on skeletal Few-shot Action Recognition (FSAR), where matching well temporal blocks (temporal chunks that make up a sequence) of support-query sequence pairs (by factoring out nuisance variations) is essential due to limited samples of novel classes. Given a query sequence, we create its several views by simulating several camera locations. For a support sequence, we match it with view-simulated query sequences, as in the popular Dynamic Time Warping (DTW). Specifically, each support temporal block can be matched to the query temporal block with the same or adjacent (next) temporal index, and adjacent camera views to achieve joint local temporal-viewpoint warping. JEANIE selects the smallest distance among matching paths with different temporal-viewpoint warping patterns, an advantage over DTW which only performs temporal alignment. We also propose an unsupervised FSAR akin to clustering of sequences with JEANIE as a distance measure. JEANIE achieves state-of-the-art results on NTU-60, NTU-120, Kinetics-skeleton and UWA3D Multiview Activity II on supervised and unsupervised FSAR, and their meta-learning inspired fusion.

Meet JEANIE: a Similarity Measure for 3D Skeleton Sequences via Temporal-Viewpoint Alignment

TL;DR

JEANIE tackles temporal-viewpoint misalignment in skeletal few-shot action recognition by introducing a joint temporal and viewpoint alignment measure that extends soft-DTW to a 4D alignment space across multiple simulated views. It couples an Encoding Network with GNN backbones to extract block-wise skeleton features and uses an extended transportation plan with a smooth 1-max shift to compute the JEANIE distance, selecting the optimal joint alignment path. The framework supports supervised FSAR, unsupervised dictionary-based FSAR, and four fusion strategies, achieving state-of-the-art results on NTU-60, NTU-120, Kinetics-skeleton, and UWA3D MV II, with notable gains from view-aware augmentation and joint alignment. The approach demonstrates robust performance under view variation and action-speed variability, offering strong practical impact for rapid adaptation to novel actions with limited data.

Abstract

Video sequences exhibit significant nuisance variations (undesired effects) of speed of actions, temporal locations, and subjects' poses, leading to temporal-viewpoint misalignment when comparing two sets of frames or evaluating the similarity of two sequences. Thus, we propose Joint tEmporal and cAmera viewpoiNt alIgnmEnt (JEANIE) for sequence pairs. In particular, we focus on 3D skeleton sequences whose camera and subjects' poses can be easily manipulated in 3D. We evaluate JEANIE on skeletal Few-shot Action Recognition (FSAR), where matching well temporal blocks (temporal chunks that make up a sequence) of support-query sequence pairs (by factoring out nuisance variations) is essential due to limited samples of novel classes. Given a query sequence, we create its several views by simulating several camera locations. For a support sequence, we match it with view-simulated query sequences, as in the popular Dynamic Time Warping (DTW). Specifically, each support temporal block can be matched to the query temporal block with the same or adjacent (next) temporal index, and adjacent camera views to achieve joint local temporal-viewpoint warping. JEANIE selects the smallest distance among matching paths with different temporal-viewpoint warping patterns, an advantage over DTW which only performs temporal alignment. We also propose an unsupervised FSAR akin to clustering of sequences with JEANIE as a distance measure. JEANIE achieves state-of-the-art results on NTU-60, NTU-120, Kinetics-skeleton and UWA3D Multiview Activity II on supervised and unsupervised FSAR, and their meta-learning inspired fusion.
Paper Structure (29 sections, 26 equations, 12 figures, 14 tables, 4 algorithms)

This paper contains 29 sections, 26 equations, 12 figures, 14 tables, 4 algorithms.

Figures (12)

  • Figure 1: Skeletal FSAR (simplified overview) takes episodes of query and support sequences, splits them into temporal blocks ($\mathbf{X}_1,...,\mathbf{X}_\tau$ and $\mathbf{X}'_1,...,\mathbf{X}'_\tau$), passes them to the Encoding Network to obtain features $\mathbf{\Psi}=[\boldsymbol{\psi}_1,...,\boldsymbol{\psi}_\tau]$ and $\mathbf{\Psi}'=[\boldsymbol{\psi}'_1,...,\boldsymbol{\psi}'_{\tau'}]$, and the Comparator which typically uses some distance measure $d(\cdot,\cdot)$, regularization $\Omega$ and the similarity classifier $\ell(\cdot,\cdot)$.
  • Figure 2: One may use ( top) stereo projections to simulate different camera views or simply use ( bottom) Euler angles to rotate 3D scene.
  • Figure 3: Our 3D skeleton-based FSAR with JEANIE. Frames from a query sequence and a support sequence are split into short-term temporal blocks $\mathbf{X}_1,...,\mathbf{X}_{\tau}$ and $\mathbf{X}'_1,...,\mathbf{X}'_{\tau'}$ of length $M$ given stride $S$. Subsequently, we generate (i) multiple rotations by $(\Delta\theta_x,\Delta\theta_y)$ of each query skeleton by either Euler angles (baseline approach) or (ii) simulated camera views (gray cameras) by camera shifts $(\Delta\theta_{az},\Delta\theta_{alt})$w.r.t. the assumed average camera location (black camera). We pass all skeletons via Encoding Network (with an optional transformer) to obtain feature tensors $\boldsymbol{\Psi}$ and $\boldsymbol{\Psi}'$, which are directed to JEANIE. We note that the temporal-viewpoint alignment takes place in 4D space (we show a 3D case with three views: $-30^\circ, 0^\circ, 30^\circ$). Temporally-wise, JEANIE starts from the same $t\!=\!(1,1)$ and finishes at $t\!=\!(\tau,\tau')$ (as in DTW). Viewpoint-wise, JEANIE starts from every possible camera shift $\Delta\theta\in\{-30^\circ, 0^\circ, 30^\circ\}$ (we do not know the true correct pose) and finishes at one of possible camera shifts. At each step, the path may move by no more than $(\pm\!\Delta\theta_{az},\pm\!\Delta\theta_{alt})$ to prevent erroneous alignments. Finally, SoftMin picks up the smallest distance.
  • Figure 4: ( top) In viewpoint-invariant learning, the distance between query features $\boldsymbol{\Psi}$ and support features $\boldsymbol{\Psi}'$ has to be computed. The blue arrow indicates that trajectories of both actions need alignment. ( bottom) In real life, subject's 3D body joints deviate from one ideal trajectory, and so advanced viewpoint alignment strategy is needed.
  • Figure 5: Euclidean dist. vs. DTW. ( top) Feature vectors $\boldsymbol{\psi}_t$ and $\boldsymbol{\psi}'_{t}$ of query and support frames (or temp. blocks) are matched along time $t$: $d_{Euclid}(\mathbf{\Psi},\mathbf{\Psi}')\!=\!\sum_t d^2(\boldsymbol{\psi}_t, \boldsymbol{\psi}'_{t})$. ( bottom) For DTW, a path with minimum aggregated distance is selected as $d_{DTW}(\mathbf{\Psi},\mathbf{\Psi}')\!=\!\sum_t d^2(\boldsymbol{\psi}_{m(t)}, \boldsymbol{\psi}'_{n(t)})$, and $m(t)$ and $n(t)$ parameterize query and support indexes. One is permitted steps $\downarrow$, $\searrow$, $\rightarrow$ in the graph. We expect $d_{DTW}\leq d_{Euclid}$.
  • ...and 7 more figures