Table of Contents
Fetching ...

OmniSapiens: A Foundation Model for Social Behavior Processing via Heterogeneity-Aware Relative Policy Optimization

Keane Ong, Sabri Boughorbel, Luwei Xiao, Chanakya Ekbote, Wei Dai, Ao Qu, Jingyao Wu, Rui Mao, Ehsan Hoque, Erik Cambria, Gianmarco Mengaldo, Paul Pu Liang

TL;DR

This work tackles the difficulty of training a single model across heterogeneous social-behavior tasks by introducing HARPO, a heterogeneity-aware relative policy optimization method that dynamically modulates on-policy advantages to prevent any single task or sample from dominating updates. HARPO pairs a structured, geometry-based modulation with inertial smoothing to balance learning signals, and, when applied to Omnisapiens-7B 2.0, yields strong multitask and zero-shot generalization performance across 10 behavioral tasks. Key contributions include a critic-free reasoning RL framework tailored for heterogeneous data and a foundation model that generalizes better to novel behavioral settings while producing richer, more interpretable reasoning traces. The results suggest that explicitly balancing heterogeneous learning signals can substantially improve unified social-behavior AI, with practical implications for robust, generalizable AI systems in real-world social contexts.

Abstract

To develop socially intelligent AI, existing approaches typically model human behavioral dimensions (e.g., affective, cognitive, or social attributes) in isolation. Although useful, task-specific modeling often increases training costs and limits generalization across behavioral settings. Recent reasoning RL methods facilitate training a single unified model across multiple behavioral tasks, but do not explicitly address learning across different heterogeneous behavioral data. To address this gap, we introduce Heterogeneity-Aware Relative Policy Optimization (HARPO), an RL method that balances leaning across heterogeneous tasks and samples. This is achieved by modulating advantages to ensure that no single task or sample carries disproportionate influence during policy optimization. Using HARPO, we develop and release Omnisapiens-7B 2.0, a foundation model for social behavior processing. Relative to existing behavioral foundation models, Omnisapiens-7B 2.0 achieves the strongest performance across behavioral tasks, with gains of up to +16.85% and +9.37% on multitask and held-out settings respectively, while producing more explicit and robust reasoning traces. We also validate HARPO against recent RL methods, where it achieves the most consistently strong performance across behavioral tasks.

OmniSapiens: A Foundation Model for Social Behavior Processing via Heterogeneity-Aware Relative Policy Optimization

TL;DR

This work tackles the difficulty of training a single model across heterogeneous social-behavior tasks by introducing HARPO, a heterogeneity-aware relative policy optimization method that dynamically modulates on-policy advantages to prevent any single task or sample from dominating updates. HARPO pairs a structured, geometry-based modulation with inertial smoothing to balance learning signals, and, when applied to Omnisapiens-7B 2.0, yields strong multitask and zero-shot generalization performance across 10 behavioral tasks. Key contributions include a critic-free reasoning RL framework tailored for heterogeneous data and a foundation model that generalizes better to novel behavioral settings while producing richer, more interpretable reasoning traces. The results suggest that explicitly balancing heterogeneous learning signals can substantially improve unified social-behavior AI, with practical implications for robust, generalizable AI systems in real-world social contexts.

Abstract

To develop socially intelligent AI, existing approaches typically model human behavioral dimensions (e.g., affective, cognitive, or social attributes) in isolation. Although useful, task-specific modeling often increases training costs and limits generalization across behavioral settings. Recent reasoning RL methods facilitate training a single unified model across multiple behavioral tasks, but do not explicitly address learning across different heterogeneous behavioral data. To address this gap, we introduce Heterogeneity-Aware Relative Policy Optimization (HARPO), an RL method that balances leaning across heterogeneous tasks and samples. This is achieved by modulating advantages to ensure that no single task or sample carries disproportionate influence during policy optimization. Using HARPO, we develop and release Omnisapiens-7B 2.0, a foundation model for social behavior processing. Relative to existing behavioral foundation models, Omnisapiens-7B 2.0 achieves the strongest performance across behavioral tasks, with gains of up to +16.85% and +9.37% on multitask and held-out settings respectively, while producing more explicit and robust reasoning traces. We also validate HARPO against recent RL methods, where it achieves the most consistently strong performance across behavioral tasks.
Paper Structure (34 sections, 33 equations, 7 figures, 7 tables, 1 algorithm)

This paper contains 34 sections, 33 equations, 7 figures, 7 tables, 1 algorithm.

Figures (7)

  • Figure 1: Sample count versus token reasoning length. Green indicates correct predictions, red indicates incorrect. HARPO induces more varied reasoning lengths for the respective tasks of (Top: HUM, Bottom: SAR), compared to GRPO.
  • Figure 2: Example of reasoning traces on pragmatic tasks humour detection (HUM) (left) and sarcasm detection (SAR) (right). HARPO is observed to reflect more explicit and varied reasoning compared to GRPO, which defaults to minimal/ no reasoning.
  • Figure 3: Reasoning traces on zero-shot generalization for autism behavioral detection (AUT) on the AVASD dataset; HumanOmniv2 tends to produce verbose reasoning, OmniSapiens RL tends to abstain from making predictions, OmniSapiens-7B 2.0 generates concise reasoning traces identifying the most salient behavior.
  • Figure 4: Comparison of relative task advantage and rewards for SAR under HARPO and GRPO. After step 355 (dotted line), HARPO exhibits increasing relative task advantage alongside increasing rewards, while GRPO maintains lower relative task advantage with decreasing rewards.
  • Figure 5: Top: Comparison of distribution of advantages between HARPO and the ablation without sample-level modulation, for the ANX task. Sample-level modulation leads to a narrowing of the advantage distribution which can result in better performance for specific tasks (additional task distributions are in App. \ref{['app:additional_results']}). Middle & Bottom: Comparison of modulation factors between HARPO (middle) and without inertial control (bottom). Inertial control enables a more gradual change in modulation factor values.
  • ...and 2 more figures