Table of Contents
Fetching ...

Hierarchical Activity Recognition and Captioning from Long-Form Audio

Peng Zhang, Qingyu Luo, Philip J. B. Jackson, Wenwu Wang

TL;DR

This work tackles the problem of understanding long-form audio with hierarchical semantics, introducing the MultiAct benchmark that provides three-level activity annotations (activity, sub-activity, event) along with fine-grained captions and high-level summaries. It contributes a unified hierarchical model built on an Auditory SlowFast backbone and a BART-based captioning decoder, enabling joint recognition and multi-resolution captioning. Extensive experiments across four tasks reveal strong baselines and highlight key challenges in modeling long-range, structured auditory semantics, especially boundary localization and procedural reasoning. Overall, MultiAct establishes a foundation for high-level activity understanding in naturalistic long-form audio and motivates future methods to capture complex temporal dependencies and hierarchical structure.

Abstract

Complex activities in real-world audio unfold over extended durations and exhibit hierarchical structure, yet most prior work focuses on short clips and isolated events. To bridge this gap, we introduce MultiAct, a new dataset and benchmark for multi-level structured understanding of human activities from long-form audio. MultiAct comprises long-duration kitchen recordings annotated at three semantic levels (activities, sub-activities and events) and paired with fine-grained captions and high-level summaries. We further propose a unified hierarchical model that jointly performs classification, detection, sequence prediction and multi-resolution captioning. Experiments on MultiAct establish strong baselines and reveal key challenges in modelling hierarchical and compositional structure of long-form audio. A promising direction for future work is the exploration of methods better suited to capturing the complex, long-range relationships in long-form audio.

Hierarchical Activity Recognition and Captioning from Long-Form Audio

TL;DR

This work tackles the problem of understanding long-form audio with hierarchical semantics, introducing the MultiAct benchmark that provides three-level activity annotations (activity, sub-activity, event) along with fine-grained captions and high-level summaries. It contributes a unified hierarchical model built on an Auditory SlowFast backbone and a BART-based captioning decoder, enabling joint recognition and multi-resolution captioning. Extensive experiments across four tasks reveal strong baselines and highlight key challenges in modeling long-range, structured auditory semantics, especially boundary localization and procedural reasoning. Overall, MultiAct establishes a foundation for high-level activity understanding in naturalistic long-form audio and motivates future methods to capture complex temporal dependencies and hierarchical structure.

Abstract

Complex activities in real-world audio unfold over extended durations and exhibit hierarchical structure, yet most prior work focuses on short clips and isolated events. To bridge this gap, we introduce MultiAct, a new dataset and benchmark for multi-level structured understanding of human activities from long-form audio. MultiAct comprises long-duration kitchen recordings annotated at three semantic levels (activities, sub-activities and events) and paired with fine-grained captions and high-level summaries. We further propose a unified hierarchical model that jointly performs classification, detection, sequence prediction and multi-resolution captioning. Experiments on MultiAct establish strong baselines and reveal key challenges in modelling hierarchical and compositional structure of long-form audio. A promising direction for future work is the exploration of methods better suited to capturing the complex, long-range relationships in long-form audio.
Paper Structure (10 sections, 2 figures, 5 tables)

This paper contains 10 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: The hierarchical structure of MultiAct and LLM-assisted annotation pipeline.
  • Figure 2: Overview of the proposed architecture.