HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios

Kunyu Peng; Junchao Huang; Xiangsheng Huang; Di Wen; Junwei Zheng; Yufan Chen; Kailun Yang; Jiamin Wu; Chongqing Hao; Rainer Stiefelhagen

HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios

Kunyu Peng, Junchao Huang, Xiangsheng Huang, Di Wen, Junwei Zheng, Yufan Chen, Kailun Yang, Jiamin Wu, Chongqing Hao, Rainer Stiefelhagen

TL;DR

This work defines Referring Human Action Segmentation (RHAS) to enable textual guidance for segmenting actions of a specific person in multi-person untrimmed videos. It introduces RHAS133, a large-scale dataset with fine-grained actions and referring expressions, and shows that existing methods struggle in this setting. To address this, it proposes HopaDIFF, a diffusion-based framework with a holistic-partial two-branch design, HP-xLSTM cross-input gate attention, and Fourier-domain conditioning to improve temporal reasoning and controllability. The approach achieves state-of-the-art results on RHAS133 across diverse evaluation scenarios, highlighting the value of combining global and local cues with frequency-aware diffusion for language-guided video understanding. The work lays a foundation for practical, language-guided analysis of complex social scenes and multi-person activities.

Abstract

Action segmentation is a core challenge in high-level video understanding, aiming to partition untrimmed videos into segments and assign each a label from a predefined action set. Existing methods primarily address single-person activities with fixed action sequences, overlooking multi-person scenarios. In this work, we pioneer textual reference-guided human action segmentation in multi-person settings, where a textual description specifies the target person for segmentation. We introduce the first dataset for Referring Human Action Segmentation, i.e., RHAS133, built from 133 movies and annotated with 137 fine-grained actions with 33h video data, together with textual descriptions for this new task. Benchmarking existing action segmentation methods on RHAS133 using VLM-based feature extractors reveals limited performance and poor aggregation of visual cues for the target person. To address this, we propose a holistic-partial aware Fourier-conditioned diffusion framework, i.e., HopaDIFF, leveraging a novel cross-input gate attentional xLSTM to enhance holistic-partial long-range reasoning and a novel Fourier condition to introduce more fine-grained control to improve the action segmentation generation. HopaDIFF achieves state-of-the-art results on RHAS133 in diverse evaluation settings. The dataset and code are available at https://github.com/KPeng9510/HopaDIFF.

HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios

TL;DR

Abstract

HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)