Table of Contents
Fetching ...

HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios

Kunyu Peng, Junchao Huang, Xiangsheng Huang, Di Wen, Junwei Zheng, Yufan Chen, Kailun Yang, Jiamin Wu, Chongqing Hao, Rainer Stiefelhagen

TL;DR

This work defines Referring Human Action Segmentation (RHAS) to enable textual guidance for segmenting actions of a specific person in multi-person untrimmed videos. It introduces RHAS133, a large-scale dataset with fine-grained actions and referring expressions, and shows that existing methods struggle in this setting. To address this, it proposes HopaDIFF, a diffusion-based framework with a holistic-partial two-branch design, HP-xLSTM cross-input gate attention, and Fourier-domain conditioning to improve temporal reasoning and controllability. The approach achieves state-of-the-art results on RHAS133 across diverse evaluation scenarios, highlighting the value of combining global and local cues with frequency-aware diffusion for language-guided video understanding. The work lays a foundation for practical, language-guided analysis of complex social scenes and multi-person activities.

Abstract

Action segmentation is a core challenge in high-level video understanding, aiming to partition untrimmed videos into segments and assign each a label from a predefined action set. Existing methods primarily address single-person activities with fixed action sequences, overlooking multi-person scenarios. In this work, we pioneer textual reference-guided human action segmentation in multi-person settings, where a textual description specifies the target person for segmentation. We introduce the first dataset for Referring Human Action Segmentation, i.e., RHAS133, built from 133 movies and annotated with 137 fine-grained actions with 33h video data, together with textual descriptions for this new task. Benchmarking existing action segmentation methods on RHAS133 using VLM-based feature extractors reveals limited performance and poor aggregation of visual cues for the target person. To address this, we propose a holistic-partial aware Fourier-conditioned diffusion framework, i.e., HopaDIFF, leveraging a novel cross-input gate attentional xLSTM to enhance holistic-partial long-range reasoning and a novel Fourier condition to introduce more fine-grained control to improve the action segmentation generation. HopaDIFF achieves state-of-the-art results on RHAS133 in diverse evaluation settings. The dataset and code are available at https://github.com/KPeng9510/HopaDIFF.

HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios

TL;DR

This work defines Referring Human Action Segmentation (RHAS) to enable textual guidance for segmenting actions of a specific person in multi-person untrimmed videos. It introduces RHAS133, a large-scale dataset with fine-grained actions and referring expressions, and shows that existing methods struggle in this setting. To address this, it proposes HopaDIFF, a diffusion-based framework with a holistic-partial two-branch design, HP-xLSTM cross-input gate attention, and Fourier-domain conditioning to improve temporal reasoning and controllability. The approach achieves state-of-the-art results on RHAS133 across diverse evaluation scenarios, highlighting the value of combining global and local cues with frequency-aware diffusion for language-guided video understanding. The work lays a foundation for practical, language-guided analysis of complex social scenes and multi-person activities.

Abstract

Action segmentation is a core challenge in high-level video understanding, aiming to partition untrimmed videos into segments and assign each a label from a predefined action set. Existing methods primarily address single-person activities with fixed action sequences, overlooking multi-person scenarios. In this work, we pioneer textual reference-guided human action segmentation in multi-person settings, where a textual description specifies the target person for segmentation. We introduce the first dataset for Referring Human Action Segmentation, i.e., RHAS133, built from 133 movies and annotated with 137 fine-grained actions with 33h video data, together with textual descriptions for this new task. Benchmarking existing action segmentation methods on RHAS133 using VLM-based feature extractors reveals limited performance and poor aggregation of visual cues for the target person. To address this, we propose a holistic-partial aware Fourier-conditioned diffusion framework, i.e., HopaDIFF, leveraging a novel cross-input gate attentional xLSTM to enhance holistic-partial long-range reasoning and a novel Fourier condition to introduce more fine-grained control to improve the action segmentation generation. HopaDIFF achieves state-of-the-art results on RHAS133 in diverse evaluation settings. The dataset and code are available at https://github.com/KPeng9510/HopaDIFF.

Paper Structure

This paper contains 22 sections, 11 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: An illustration of the referring human action segmentation task is shown. The example is taken from the RHAS133 dataset, with annotations provided for two referred individuals. The numbers beneath each annotation indicate the frame indices marking the start and end of each action. N/A denotes that the referred person is not present within the corresponding frame interval.
  • Figure 2: An illustration of the statistics of the dataset. The figure on the left-hand side shows the number of frames per action category, and the figure on the right is the word cloud generated based on the textual annotation in our RHAS133 dataset. Zoom in for a better view.
  • Figure 3: An overview of the proposed HopaDIFF, which integrates two complementary diffusion-based branches, i.e., holistic and partial branches for action segmentation with target-referenced awareness. To improve controllability and segmentation precision, we introduce HP-xLSTM, a cross-input gated module designed for effective exchange between holistic and partial features, and propose a novel Fourier-based conditioning mechanism to inject frequency-domain control signals into the generative process. During training, the two branches are individually supervised using ground-truth action labels and temporal boundary annotations.
  • Figure 4: An overview of the statistics regarding the number of persons per movie in our RHAS133 dataset. The horizontal axis denotes the video ID, and the vertical axis denotes the number of persons annotated in the corresponding video.
  • Figure 5: Qualitative results of our HopaDIFF and FACT baseline, where false predictions are marked as green and correct predictions are marked as blue. Each color shown in GT denotes a different set of combinations of atomic-level actions.
  • ...and 1 more figures