Collaborative Temporal Feature Generation via Critic-Free Reinforcement Learning for Cross-User Sensor-Based Activity Recognition

Xiaozhou Ye; Feng Jiang; Zihan Wang; Xiulai Wang; Yutao Zhang; Kevin I-Kai Wang

Collaborative Temporal Feature Generation via Critic-Free Reinforcement Learning for Cross-User Sensor-Based Activity Recognition

Xiaozhou Ye, Feng Jiang, Zihan Wang, Xiulai Wang, Yutao Zhang, Kevin I-Kai Wang

Abstract

Human Activity Recognition using wearable inertial sensors is foundational to healthcare monitoring, fitness analytics, and context-aware computing, yet its deployment is hindered by cross-user variability arising from heterogeneous physiological traits, motor habits, and sensor placements. Existing domain generalization approaches either neglect temporal dependencies in sensor streams or depend on impractical target-domain annotations. We propose a different paradigm: modeling generalizable feature extraction as a collaborative sequential generation process governed by reinforcement learning. Our framework, CTFG (Collaborative Temporal Feature Generation), employs a Transformer-based autoregressive generator that incrementally constructs feature token sequences, each conditioned on prior context and the encoded sensor input. The generator is optimized via Group-Relative Policy Optimization, a critic-free algorithm that evaluates each generated sequence against a cohort of alternatives sampled from the same input, deriving advantages through intra-group normalization rather than learned value estimation. This design eliminates the distribution-dependent bias inherent in critic-based methods and provides self-calibrating optimization signals that remain stable across heterogeneous user distributions. A tri-objective reward comprising class discrimination, cross-user invariance, and temporal fidelity jointly shapes the feature space to separate activities, align user distributions, and preserve fine-grained temporal content. Evaluations on the DSADS and PAMAP2 benchmarks demonstrate state-of-the-art cross-user accuracy (88.53\% and 75.22\%), substantial reduction in inter-task training variance, accelerated convergence, and robust generalization under varying action-space dimensionalities.

Collaborative Temporal Feature Generation via Critic-Free Reinforcement Learning for Cross-User Sensor-Based Activity Recognition

Abstract

Paper Structure (38 sections, 1 theorem, 13 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 38 sections, 1 theorem, 13 equations, 6 figures, 5 tables, 1 algorithm.

Introduction
Related work
Generalizable representations for wearable sensor HAR
Policy optimization: from critic-based to critic-free methods
Temporal modeling for sensor data
Methodology
Problem setting and notation
Feature extraction as a Markov decision process
Autoregressive feature generation architecture
Temporal encoder
Autoregressive feature decoder
Group-Relative Policy Optimization
Limitations of critic-based advantage estimation
GRPO formulation
Formal analysis
...and 23 more sections

Key Result

Proposition 1

Let $R^{(1)}, \dots, R^{(G)}$ be i.i.d. reward samples. The group-relative advantage $\hat{A}^{(g)}$ in Eq. eq:grpo_advantage satisfies: (i) $\mathbb{E}[\hat{A}^{(g)}] = 0$ (zero-centered); (ii) $\text{Var}(\hat{A}^{(g)}) \to 1$ as $G \to \infty$; (iii) Affine invariance: $\hat{A}^{(g)}(aR + b) = \h

Figures (6)

Figure 1: Overview of the proposed CTFG framework. During training, the autoregressive generator produces $G$ candidate feature sequences per input, evaluated by a tri-objective reward ($R_{\text{cls}}$, $R_{\text{inv}}$, $R_{\text{tmp}}$) and optimized via group-relative advantages $\hat{A}^{(g)}$ without a value function. During inference, a single deterministic forward pass (using predicted means $\mu_{i,j}$ only) generates features, which are flattened into $\tilde{z}_i = \text{vec}(z_i)$ and classified by Logistic Regression to produce the predicted label $\hat{y}_i$.
Figure 2: Group-relative advantage in the latent space. For each input, $G$ sampled feature sequences are compared against their group mean. Sequences achieving above-average reward receive positive advantages and are reinforced; below-average sequences are suppressed.
Figure 3: Convergence comparison on PAMAP2 (epochs 0--100). Shaded regions indicate $\pm$1 standard deviation across leave-one-group-out configurations.
Figure 4: Convergence comparison on DSADS (epochs 0--100). GRPO converges faster with monotonically decreasing variance, while PPO exhibits mid-training instability.
Figure 5: Token count sensitivity on DSADS. GRPO maintains stable accuracy with decreasing variance, while PPO collapses at high token counts.
...and 1 more figures

Theorems & Definitions (2)

Remark 1: Why RL over supervised optimization
Proposition 1: Properties of group-relative advantage

Collaborative Temporal Feature Generation via Critic-Free Reinforcement Learning for Cross-User Sensor-Based Activity Recognition

Abstract

Collaborative Temporal Feature Generation via Critic-Free Reinforcement Learning for Cross-User Sensor-Based Activity Recognition

Authors

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (2)